AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

A more efficient multimodal large language model series.

Wenhao Chai^{1, 2}, Enxin Song, Yilun Du⁴, Chenlin Meng^{2, 3},

Vashisht Madhavan², Omer Bar-Tal², Jenq-Neng Hwang¹,

Saining Xie⁵, Christopher D. Manning³,

¹University of Washington, ²Pika Lab, ³Stanford University, ⁴Harvard University, ⁵New York University

arXiv Code

Model

VDC Benchmark

Data

Video Detail Caption Leaderboard

🤩Welcome! Submit your scores now and watch the leaderboard refresh with your achievements!

Please remember to report your frame rate and tokens per frame with each submission.

Email us at or .

We present a quantitative comparison between AuroraCap with existing state-of-the-art large multimodal models across various sections of structured captions in VDC. # F stands for the frame sampling number of the input video, and TPF represents the visual tokens per frame. The average key frame number in VDC is 10.

VDC Example

[short caption]	(26 words) In this video, two smartphones are compared side by side as they launch and run the game \'Angry Birds 2\', showcasing their performance and loading times.
[background caption]	(65 words) The video is set against a clean, white background that emphasizes the two smartphones placed side by side. The lighting is bright and even, creating a neutral atmosphere. The time displayed at the bottom of the screen counts up, indicating the duration of the gameplay. The environment is quiet, with no background noise, allowing the focus to remain on the visual performance of the devices.
[main object caption]	(85 words) The main subjects are two smartphones, one on the left and one on the right, both displaying the game \'Angry Birds 2\'. The left phone shows a vibrant loading screen with colorful graphics, while the right phone initially displays a blank screen before transitioning to the game. The hands of the presenter are visible, interacting with the devices, tapping the screens to initiate the game. The presenter\'s fingers move swiftly, indicating a sense of urgency and excitement as they compare the responsiveness of both devices.
[camera caption]	(130 words) The camera work is steady and focused, primarily using a medium shot that captures both smartphones in their entirety. The angle is slightly above the devices, providing a clear view of the screens and the presenter\'s hands. There are no significant camera movements; instead, the focus remains on the devices as they load the game. The video includes a timer overlay at the bottom, which counts the seconds, enhancing the comparative aspect of the performance test. The video opens with a visually striking presentation of two smartphones positioned side by side against a pristine white background, which serves to enhance the focus on the devices themselves. The lighting is bright and evenly distributed, creating a neutral and distraction-free atmosphere that allows viewers to concentrate on the smartphones and their performance.
[detailed caption]	(146 words) The video opens with a visually striking presentation of two smartphones positioned side by side against a pristine white background, which serves to enhance the focus on the devices themselves. The lighting is bright and evenly distributed, creating a neutral and distraction-free atmosphere that allows viewers to concentrate on the smartphones and their performance. Both devices prominently display the YouTube application interface, featuring a video titled "Dart Moon Collision," which is highlighted as "#1 ON TRENDING" from the "NASA.gov Video" channel. Below the video pane, viewers can see the view counts and like/dislike statistics, along with a curated list of additional video suggestions at the bottom of the screens. A consistent timestamp of 6:12 PM is visible, adding a temporal context to the scene. A watermark logo is also present, likely indicating the content\'s publisher or owner, subtly reinforcing the source of the material being showcased.

[short caption]	(25 words) A young woman with curly hair engages with her laptop, transitioning from a focused expression to a smile, reflecting a journey of discovery or realization.
[background caption]	(70 words) The setting is minimalistic, featuring a soft blue backdrop that creates a calm and serene atmosphere. The lighting is bright and even, enhancing the subject\'s features and the smooth surface of the laptop. There are no distracting elements in the background, allowing the viewer to focus entirely on the subject\'s expressions and actions. The environment is quiet, with a sense of stillness that emphasizes the woman\'s concentration and eventual joy.
[main object caption]	(93 words) The main subject, a young woman with a voluminous curly hairstyle, is dressed in a light blue button-up shirt. Initially, her expression is serious and contemplative as she gazes intently at the laptop screen, her brow slightly furrowed. As she interacts with the device, her posture is slightly hunched forward, indicating focus and engagement. Gradually, her expression shifts to one of delight, with a smile breaking across her face, suggesting a positive revelation or achievement. Her fingers move deftly over the laptop\'s keyboard, showcasing her active participation in whatever task she is undertaking.
[camera caption]	(130 words) The camera work is steady and focused, primarily using a medium shot that captures both smartphones in their entirety. The angle is slightly above the devices, providing a clear view of the screens and the presenter\'s hands. There are no significant camera movements; instead, the focus remains on the devices as they load the game. The video includes a timer overlay at the bottom, which counts the seconds, enhancing the comparative aspect of the performance test. The video opens with a visually striking presentation of two smartphones positioned side by side against a pristine white background, which serves to enhance the focus on the devices themselves. The lighting is bright and evenly distributed, creating a neutral and distraction-free atmosphere that allows viewers to concentrate on the smartphones and their performance.
[detailed caption]	(410 words) The video presents a captivating scene featuring a young woman with voluminous, curly hair, elegantly styled and framing her face. She is dressed in a light blue collared shirt that complements her complexion and adds a touch of calmness to the overall aesthetic. The setting is minimalistic, with a soft blue backdrop that enhances the serene atmosphere, creating a tranquil environment conducive to concentration. The lighting is bright and even, illuminating her features and the sleek surface of the laptop she is holding, while ensuring that no distracting elements are present in the background. This simplicity allows the viewer to focus entirely on her expressions and actions, which are central to the narrative. As the video unfolds, the woman maintains a consistent posture, initially appearing serious and contemplative as she gazes intently at the laptop screen. Her brow is slightly furrowed, indicating deep thought and engagement with the task at hand. She leans slightly forward, her body language reflecting her focus and determination. The camera captures her in a series of close-up shots, providing an intimate view of her facial expressions and hand movements. The angles are primarily frontal, allowing the audience to connect with her emotional journey as she interacts with the device. Throughout the sequence, the woman\'s expression transitions from one of concentration to a radiant smile, suggesting a journey of discovery or realization. As she navigates through the content on her laptop, her fingers move deftly over the keyboard, showcasing her active participation and engagement with whatever task she is undertaking. The smooth transitions between shots maintain a fluid narrative flow, with the focus subtly shifting between her face and the laptop screen. This technique emphasizes the connection between her emotional responses and the content she is interacting with, drawing the viewer deeper into her experience. The overall composition of the video is clean and visually appealing, with a shallow depth of field that blurs the background slightly, ensuring that the viewer\'s attention remains on the subject. The stillness of the environment enhances the sense of concentration, while the eventual shift in her expression to one of delight signifies a positive revelation or achievement. This moment of joy, captured in her smile, serves as a poignant reminder of the satisfaction that can come from engaging with technology and the discoveries it can facilitate. The video encapsulates a moment of introspection and triumph, inviting the viewer to share in the woman\'s journey as she navigates her digital landscape.

Abstract

Baseline: Video detailed captioning is a key task which aims to generate comprehensive and coherent textual descriptions of video content, benefiting both video understanding and generation. In this paper, we propose AuroraCap, a video captioner based on a large multimodal model. We follow the simplest architecture design without additional parameters for temporal modeling. To address the overhead caused by lengthy video sequences, we implement the token merging strategy, reducing the number of input visual tokens. Surprisingly, we found that this strategy results in little performance loss. AuroraCap shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and Gemini-1.5 Pro (82.2).

Benchmark and Metric: However, existing video caption benchmarks only include simple descriptions, consisting of a few dozen words, which limits research in this field. Therefore, we develop VDC, a video detailed captioning benchmark with over one thousand carefully annotated structured captions. In addition, we propose a new LLM-assisted metric VDCScore for bettering evaluation, which adopts a divide-and-conquer strategy to transform long caption evaluation into multiple short question-answer pairs. With the help of human Elo ranking, our experiments show that this benchmark better correlates with human judgments of video detailed captioning quality.

AuroraCap: A Efficient and Performant Video Detailed Captioner

Architecture

LLaVA. To effectively leverage the capabilities of both the pre-trained LLM and visual model, LLaVA adapt a simple multilayer perceptron (MLP) projection layer to connect each patch tokens of image features into the word embedding space.

Token merging. To increase the throughput of existing ViT models, Token Merging is proposed to gradually combines similar tokens in a transformer to reduce the number of tokens passing through ViT models. Token Merging has been proven to be effective on image and video classification tasks even without the need for training. We conduct frame-wise token merging in AuroraCap, where the feature is extracted by CLIP ViT-H model. We show token merging visualization examples from COCO, VG, SA-1B as follows:

Token merging visualization. From left to right, the number of visual tokens representing the images are 490, 154, 18, and 6.

Training Recipe

We use over 20 million high-quality image/video-text pairs to train AuroraCap in three stages. The training datasets are released at HuggingFace.

Pretraining stage. We first align visual features with the word embedding space of LLMs. To achieve this, we freeze the pretrained ViT and LLM, training solely the vision-language connector.

Vision stage. We unfreeze the pretrained ViT while freezing the LLM during vision stage and train with the public data among various computer vision tasks to get better generalization.

Language stage. Finally, we conduct end-to-end training, which means all the components are trainable, with the most high-quality public data during language stage.

VDC: A New Video Detailed Captioning Benchmark

Benchmark Collection and Processing

Video collection and processing. We building VDC upon Panda-70M, Ego4D, Mixkit, Pixabay, and Pexels. We first split the video into clips and apply dense frame extraction, then manually replacing blurry frames with adjacent clear ones.

**Table 1 :** Benchmark comparison for video captioning task. Ave. Length indicates the average number of words per caption.
Dataset	Theme	# Video	# Clip	# Caption	# Word	# Vocab.	Ave. Length
MSVD	Open	1,970	1,970	70,028	607,339	13,010	8.67
MSR-VTT	Open	7,180	10,000	200,000	1,856,523	29,316	9.28
ActivityNet	Open	20,000	100,000	100,000	1,340,000	15,564	13.40
S-MiT	Open	515,912	515,912	515,912	5,618,064	50,570	10.89
M-VAD	Movie	92	48,986	55,905	519,933	18,269	9.30
MPII-MD	Movie	94	68,337	68,375	653,467	24,549	9.56
Youcook2	Cooking	2,000	15,400	15,400	121,418	2,583	7.88
Charades	Human	9,848	10,000	27,380	607,339	13,000	22.18
VATEX	Open	41,300	41,300	413,000	4,994,768	44,103	12.09
VDC (ours)	Open	1,027	1,027	1,027	515,441	20,419	500.91

Structured detailed captions construction pipeline. We develop a structured detailed captions construction pipeline to generate extra detailed descriptions from various perspectives, significantly extending the length and enhancing the richness compared to previous benchmarks. The structured detailed captions includes camera, short, background, main object, and detailed captions.

Camera caption. Describe the camera work in detail, including shot types, angles, movements, transitions, and any special effects used to enhance the video.
Short caption. Summarize the video in one detailed sentence, capturing key actions and the overall mood.
Background caption. Provide a detailed description of the background, including objects, location, weather, time, and any dynamic elements.
Main Object caption. Give a thorough description of the main subject's actions, attributes, interactions, and movements throughout the video frames.
Detailed caption. Generate a detailed, vivid caption for the video, covering all categories, ensuring it's engaging, informative, and rich enough for AI to recreate the video content.

To generate detailed, fine-grained, and accurate captions, we leverage GPT-4o to produce video descriptions. We design a hierarchical prompt strategy to efficiently obtain accurate structured captions and detailed captions in two conversation rounds: (1) Structured Captions Generation and (2) Detailed Captions Integration.

Video length in VDC — Distribution of the video length and structured caption length in VDC.

VDCscore: Evaluating Detailed Captions with LLMs

We introduce VDCscore, a novel quantitative metric that utilizes LLMs to evaluate the similarity between predicted and ground-truth detailed captions through a divide-and-conquer approach. The core idea of VDCscore is to decompose long detailed captions into multiple short question-answering pairs, avergae the evaluation of each pair as the final result.

Evaluation

Benchmarking video detailed captioning.

AuroraCap achieves superior performance in video detailed captioning while utilizing significantly fewer visual tokens than other models, fully highlighting the efficiency of AuroraCap.

Ablation Study

As a core training and inference strategy of AuroraCap, token merging plays a significant role in reducing the number of visual tokens. We further study how the video detailed captioning capability is influenced by token merge ratio.

We define the performance percentage as the proportion between the highest and lowest values on the entire performance curve. We highlight the token merging ratio when achieving 90% and 80% performance with the dash line and filled area. We found that token merging significantly reduces the number of tokens while maintaining minimal performance drop, and even showing improvement in some tasks.

Ablation study on token merging ratio on various image and video understanding tasks.

To assess the inference speed, we utilize the inference time per video question-answering pair in seconds (TPV) as an evaluative metric. Figure below indicates the minimum TPV achievable in our settings including with or without token merging and SGLang across seven video understanding datasets. Reducing the visual tokens and using SGLang result in excellent inference times per video question-answering pair while all the datasets with short video and question inputs.

Case Study

We perform an extensive case study of AuroraCap on a variety of videos for video detailed captioning. As shown as followings, AuroraCap is capable of providing excellent detailed captions regarding the camera motion, background and main object with less hallucination.

Our Related Work

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

The first long-form video open-ended benchmark.

CVPR 2024

arXiv Leaderboard

MovieChat-1K Benchmark Code

BibTeX

                @article{auroracap,
                    title={AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark},
                    author={Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning},
                    year={2024},
                    journal={arXiv preprint arXiv:2410.03051},
                }