A more efficient multimodal large language model series.
Please remember to report your frame rate and tokens per frame with each submission.
We present a quantitative comparison between AuroraCap with existing state-of-the-art large multimodal models across various sections of structured captions in VDC. # F stands for the frame sampling number of the input video, and TPF represents the visual tokens per frame. The average key frame number in VDC is 10.
|
|
Baseline: Video detailed captioning is a key task which aims to generate comprehensive and coherent textual descriptions of video content, benefiting both video understanding and generation. In this paper, we propose AuroraCap, a video captioner based on a large multimodal model. We follow the simplest architecture design without additional parameters for temporal modeling. To address the overhead caused by lengthy video sequences, we implement the token merging strategy, reducing the number of input visual tokens. Surprisingly, we found that this strategy results in little performance loss. AuroraCap shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and Gemini-1.5 Pro (82.2).
Benchmark and Metric: However, existing video caption benchmarks only include simple descriptions, consisting of a few dozen words, which limits research in this field. Therefore, we develop VDC, a video detailed captioning benchmark with over one thousand carefully annotated structured captions. In addition, we propose a new LLM-assisted metric VDCScore for bettering evaluation, which adopts a divide-and-conquer strategy to transform long caption evaluation into multiple short question-answer pairs. With the help of human Elo ranking, our experiments show that this benchmark better correlates with human judgments of video detailed captioning quality.
LLaVA.
Token merging.
We use over 20 million high-quality image/video-text pairs to train AuroraCap in three stages. The training datasets are released at HuggingFace.
Pretraining stage. We first align visual features with the word embedding space of LLMs. To achieve this, we freeze the pretrained ViT and LLM, training solely the vision-language connector.
Vision stage. We unfreeze the pretrained ViT while freezing the LLM during vision stage and train with the public data among various computer vision tasks to get better generalization.
Language stage. Finally, we conduct end-to-end training, which means all the components are trainable, with the most high-quality public data during language stage.
Video collection and processing.
We building VDC upon Panda-70M
Dataset | Theme | # Video | # Clip | # Caption | # Word | # Vocab. | Ave. Length |
---|---|---|---|---|---|---|---|
MSVD |
Open | 1,970 | 1,970 | 70,028 | 607,339 | 13,010 | 8.67 |
MSR-VTT |
Open | 7,180 | 10,000 | 200,000 | 1,856,523 | 29,316 | 9.28 |
ActivityNet |
Open | 20,000 | 100,000 | 100,000 | 1,340,000 | 15,564 | 13.40 |
S-MiT |
Open | 515,912 | 515,912 | 515,912 | 5,618,064 | 50,570 | 10.89 |
M-VAD |
Movie | 92 | 48,986 | 55,905 | 519,933 | 18,269 | 9.30 |
MPII-MD |
Movie | 94 | 68,337 | 68,375 | 653,467 | 24,549 | 9.56 |
Youcook2 |
Cooking | 2,000 | 15,400 | 15,400 | 121,418 | 2,583 | 7.88 |
Charades |
Human | 9,848 | 10,000 | 27,380 | 607,339 | 13,000 | 22.18 |
VATEX |
Open | 41,300 | 41,300 | 413,000 | 4,994,768 | 44,103 | 12.09 |
VDC (ours) | Open | 1,027 | 1,027 | 1,027 | 515,441 | 20,419 | 500.91 |
Structured detailed captions construction pipeline. We develop a structured detailed captions construction pipeline to generate extra detailed descriptions from various perspectives, significantly extending the length and enhancing the richness compared to previous benchmarks. The structured detailed captions includes camera, short, background, main object, and detailed captions.
To generate detailed, fine-grained, and accurate captions, we leverage GPT-4o to produce video descriptions. We design a hierarchical prompt strategy to efficiently obtain accurate structured captions and detailed captions in two conversation rounds: (1) Structured Captions Generation and (2) Detailed Captions Integration.
We introduce VDCscore, a novel quantitative metric that utilizes LLMs to evaluate the similarity between predicted and ground-truth detailed captions through a divide-and-conquer approach. The core idea of VDCscore is to decompose long detailed captions into multiple short question-answering pairs, avergae the evaluation of each pair as the final result.
AuroraCap achieves superior performance in video detailed captioning while utilizing significantly fewer visual tokens than other models, fully highlighting the efficiency of AuroraCap.
We perform an extensive case study of AuroraCap on a variety of videos for video detailed captioning. As shown as followings, AuroraCap is capable of providing excellent detailed captions regarding the camera motion, background and main object with less hallucination.
@article{auroracap, title={AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark}, author={Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning}, year={2024}, journal={arXiv preprint arXiv:2410.03051}, }