We propose AuroraCap, a simple video captioner based on large multimodal model. We follow the simplest architecture design without additional parameters for temporal modeling. To address the overhead caused by lengthy video sequences, we implement the token merging strategy, reducing the number of visual tokens input. We present VDC, a video detailed captioning benchmark with over one thousand carefully annotated structured captions. In addition, we propose a new LLM-assisted metric VDCscore for bettering evaluation. We adopt a divide-and-conquer strategy to transform the evaluation of long captions into multiple short question-answering pairs.
LLaVA.
Token merging.
We use over 20 million high-quality image/video-text pairs to train AuroraCap in three stages. The training datasets are released at HuggingFace.
Pretraining stage. We first align visual features with the word embedding space of LLMs. To achieve this, we freeze the pretrained ViT and LLM, training solely the vision-language connector.
Vision stage. We unfreeze the pretrained ViT while freezing the LLM during vision stage and train with the public data among various computer vision tasks to get better generalization.
Language stage. Finally, we conduct end-to-end training, which means all the components are trainable, with the most high-quality public data during language stage.
Dataset | Theme | # Video | # Clip | # Caption | # Word | # Vocab. | Ave. Length |
---|---|---|---|---|---|---|---|
MSVD |
Open | 1,970 | 1,970 | 70,028 | 607,339 | 13,010 | 8.67 |
MSR-VTT |
Open | 7,180 | 10,000 | 200,000 | 1,856,523 | 29,316 | 9.28 |
ActivityNet |
Open | 20,000 | 100,000 | 100,000 | 1,340,000 | 15,564 | 13.40 |
S-MiT |
Open | 515,912 | 515,912 | 515,912 | 5,618,064 | 50,570 | 10.89 |
M-VAD |
Movie | 92 | 48,986 | 55,905 | 519,933 | 18,269 | 9.30 |
MPII-MD |
Movie | 94 | 68,337 | 68,375 | 653,467 | 24,549 | 9.56 |
Youcook2 |
Cooking | 2,000 | 15,400 | 15,400 | 121,418 | 2,583 | 7.88 |
Charades |
Human | 9,848 | 10,000 | 27,380 | 607,339 | 13,000 | 22.18 |
VATEX |
Open | 41,300 | 41,300 | 413,000 | 4,994,768 | 44,103 | 12.09 |
VDC (ours) | Open | 1,027 | 1,027 | 1,027 | 515,441 | 20,419 | 500.91 |
Video collection and processing.
We building VDC upon Panda-70M
Structured detailed captions construction pipeline.
We develop a structured detailed captions construction pipeline to generate extra detailed descriptions from various perspectives, significantly extending the length and enhancing the richness compared to previous benchmarks.
The structured detailed captions includes the following categories:
To generate detailed, fine-grained, and accurate captions, we leverage GPT-4o to produce video descriptions. We design a hierarchical prompt strategy to efficiently obtain accurate structured captions and detailed captions in two conversation rounds: (1) Structured Captions Generation and (2) Detailed Captions Integration.
We introduce VDCscore, a novel quantitative metric that utilizes LLMs to evaluate the similarity between predicted and ground-truth detailed captions through a divide-and-conquer approach.
The core idea of VDCscore is to decompose long detailed captions into multiple short question-answering pairs, avergae the evaluation of each pair as the final result.
Benchmarking video detailed captioning.
AuroraCap achieves superior performance in video detailed captioning while utilizing significantly fewer visual tokens than other models, fully highlighting the efficiency of AuroraCap.
Model | Ave.VDC | Camera | Short | BG | Object | Detailed |
---|---|---|---|---|---|---|
Gemini-1.5 Pro |
41.73 | 38.68 | 35.71 | 43.84 | 47.32 | 43.11 |
AuroraCap-7B | 38.21 | 43.50 | 32.07 | 35.92 | 39.02 | 41.30 |
InternVL-2-8B |
37.72 | 39.08 | 33.02 | 37.47 | 44.16 | 34.89 |
LLAVA-OV-7B |
37.45 | 37.82 | 32.58 | 37.43 | 38.21 | 41.20 |
ShareGPT4Video-8B |
36.17 | 33.28 | 39.05 | 35.77 | 37.12 | 35.62 |
LLAVA-1.6-13B |
35.85 | 35.61 | 31.90 | 38.90 | 36.65 | 36.18 |
LLAVA-1.6-7B |
35.70 | 36.50 | 31.91 | 37.58 | 36.03 | 36.47 |
LLAVA-NeXT-V7B |
35.46 | 39.73 | 30.63 | 36.54 | 36.54 | 33.84 |
LLAVA-1.5-13B |
34.78 | 38.97 | 30.89 | 34.79 | 36.27 | 33.00 |
LongVA-7B |
34.50 | 35.32 | 31.94 | 36.39 | 40.95 | 27.91 |
LLAVA-1.5-7B |
33.98 | 38.38 | 28.61 | 34.86 | 34.62 | 33.43 |
Video-LLAVA-7B |
32.80 | 37.48 | 30.67 | 32.50 | 36.01 | 27.36 |
VILA-7B |
32.61 | 34.33 | 30.40 | 35.15 | 33.38 | 29.78 |
MovieChat-7B |
31.92 | 37.25 | 32.55 | 28.99 | 31.97 | 28.82 |
Video-ChatGPT-7B |
31.12 | 37.46 | 29.36 | 33.68 | 30.47 | 24.61 |
LLAMA-VID |
30.86 | 39.47 | 29.92 | 28.01 | 31.24 | 25.67 |
Vicuna-v1.5-7B |
22.50 | 21.68 | 23.06 | 22.02 | 22.64 | 23.09 |
Llama-3.1-8B |
18.98 | 17.83 | 17.90 | 19.52 | 19.57 | 20.10 |
We evaluate AuroraCap using CIDEr, BELU-4, BELU-1, METEOR, and ROUGE-L metric on Flickr
Model | Flickr |
NoCaps |
COCO-Cap |
|||
---|---|---|---|---|---|---|
C | R | C | R | C | R | |
LLaVA-1.5-7B |
74.9 | 52.8 | 105.5 | 59.4 | 110.3 | 55.5 |
LLaVA-1.5-13B |
79.4 | 53.9 | 109.2 | 60.3 | 115.6 | 56.5 |
LLaVA-1.6-7B |
68.4 | 50.3 | 88.4 | 54.6 | 99.9 | 52.4 |
LLaVA-1.6-13B |
66.6 | 48.8 | 88.1 | 54.9 | 101.8 | 52.1 |
MiniCPM-V-3B |
66.8 | 51.0 | 89.9 | 55.8 | 94.2 | 52.3 |
DeCap |
56.7 | — | 42.7 | — | 91.2 | — |
Flamingo-80B |
67.2 | — | — | — | 84.3 | — |
Chameleon-34B |
74.72 | — | — | — | 120.22 | — |
GPT-4V | 55.38 | — | — | — | 78.58 | — |
Gemini-1.5 Pro |
82.24 | — | — | — | 99.82 | — |
AuroraCap-7B | 88.9 | 55.4 | 111.4 | 60.6 | 120.8 | 57.2 |
Video captioning.
Although the current video captioning benchmarks are only contains one-sentence captions, to compare with prior work, we similarly evaluate on these benchmarks.
We evaluate AuroraCap on MSR-VTT
Model | MSR-VTT |
VATEX |
||||||||
---|---|---|---|---|---|---|---|---|---|---|
C | B@1 | B@4 | M | R | C | B@1 | B@4 | M | R | |
ZeroCap |
9.6 | — | 2.9 | 16.3 | 35.4 | — | — | — | — | — |
DeCap |
18.6 | — | 14.7 | 20.4 | — | 18.7 | — | 13.1 | 15.3 | — |
PaLI-3 |
21.3 | — | — | — | — | — | — | — | — | — |
Ma et al. |
22.1 | — | 3.5 | 17.3 | 28.7 | 23.9 | — | 2.8 | 14.1 | 23.5 |
LLaVA-7B |
16.9 | — | — | — | — | — | — | — | — | — |
Video-LLAMA |
2.3 | — | 4.9 | 16.8 | — | 3.8 | — | 4.3 | 16.3 | 21.8 |
AuroraCap-7B | 33.1 | 58.6 | 21.0 | 23.9 | 49.5 | 33.8 | 57.1 | 18.4 | 19.0 | 40.8 |
As a core training and inference strategy of AuroraCap, token merging plays a significant role in reducing the number of visual tokens. We further study how the video detailed captioning capability is influenced by token merge ratio.
We define the performance percentage as the proportion between the highest and lowest values on the entire performance curve. We highlight the token merging ratio when achieving 90% and 80% performance with the dash line and filled area. We found that token merging significantly reduces the number of tokens while maintaining minimal performance drop, and even showing improvement in some tasks.
To assess the inference speed, we utilize the inference time per video question-answering pair in seconds (TPV) as an evaluative metric. Figure below indicates the minimum TPV achievable in our settings including with or without token merging and SGLang across seven video understanding datasets. Reducing the visual tokens and using SGLang result in excellent inference times per video question-answering pair while all the datasets with short video and question inputs.
We perform an extensive case study of AuroraCap on a variety of videos for video detailed captioning. As shown as followings, AuroraCap is capable of providing excellent detailed captions regarding the camera motion, background and main object with less hallucination.
@article{auroracap,
title={AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark},
author={Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning},
year={2024},
journal={arXiv preprint arXiv:2410.03051},
}