Clicky

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

A more efficient multimodal large language model series.

1University of Washington, 2Pika Lab, 3Stanford University, 4Harvard University, 5New York University

Leaderboard


🤩Welcome! Submit your scores now and watch the leaderboard refresh with your achievements!


Please remember to report your frame rate and tokens per frame with each submission.

Email us at or .


We present a quantitative comparison between AuroraCap with existing state-of-the-art large multimodal models across various sections of structured captions in VDC. # F stands for the frame sampling number of the input video, and TPF represents the visual tokens per frame. The average key frame number in VDC is 10.


VDC Example


[short caption] (26 words) In this video, two smartphones are compared side by side as they launch and run the game \'Angry Birds 2\', showcasing their performance and loading times.
[background caption] (65 words) The video is set against a clean, white background that emphasizes the two smartphones placed side by side. The lighting is bright and even, creating a neutral atmosphere. The time displayed at the bottom of the screen counts up, indicating the duration of the gameplay. The environment is quiet, with no background noise, allowing the focus to remain on the visual performance of the devices.
[main object caption] (85 words) The main subjects are two smartphones, one on the left and one on the right, both displaying the game \'Angry Birds 2\'. The left phone shows a vibrant loading screen with colorful graphics, while the right phone initially displays a blank screen before transitioning to the game. The hands of the presenter are visible, interacting with the devices, tapping the screens to initiate the game. The presenter\'s fingers move swiftly, indicating a sense of urgency and excitement as they compare the responsiveness of both devices.
[camera caption] (76 words) The camera work is steady and focused, primarily using a medium shot that captures both smartphones in their entirety. The angle is slightly above the devices, providing a clear view of the screens and the presenter\'s hands. There are no significant camera movements; instead, the focus remains on the devices as they load the game. The video includes a timer overlay at the bottom, which counts the seconds, enhancing the comparative aspect of the performance test. The video opens with a visually striking presentation of two smartphones positioned side by side against a pristine white background, which serves to enhance the focus on the devices themselves. The lighting is bright and evenly distributed, creating a neutral and distraction-free atmosphere that allows viewers to concentrate on the smartphones and their performance.
[detailed caption] (609 words) The video opens with a visually striking presentation of two smartphones positioned side by side against a pristine white background, which serves to enhance the focus on the devices themselves. The lighting is bright and evenly distributed, creating a neutral and distraction-free atmosphere that allows viewers to concentrate on the smartphones and their performance. Both devices prominently display the YouTube application interface, featuring a video titled "Dart Moon Collision," which is highlighted as "#1 ON TRENDING" from the "NASA.gov Video" channel. Below the video pane, viewers can see the view counts and like/dislike statistics, along with a curated list of additional video suggestions at the bottom of the screens. A consistent timestamp of 6:12 PM is visible, adding a temporal context to the scene. A watermark logo is also present, likely indicating the content\'s publisher or owner, subtly reinforcing the source of the material being showcased.
[short caption] (25 words) A young woman with curly hair engages with her laptop, transitioning from a focused expression to a smile, reflecting a journey of discovery or realization.
[background caption] (70 words) The setting is minimalistic, featuring a soft blue backdrop that creates a calm and serene atmosphere. The lighting is bright and even, enhancing the subject\'s features and the smooth surface of the laptop. There are no distracting elements in the background, allowing the viewer to focus entirely on the subject\'s expressions and actions. The environment is quiet, with a sense of stillness that emphasizes the woman\'s concentration and eventual joy.
[main object caption] (93 words) The main subject, a young woman with a voluminous curly hairstyle, is dressed in a light blue button-up shirt. Initially, her expression is serious and contemplative as she gazes intently at the laptop screen, her brow slightly furrowed. As she interacts with the device, her posture is slightly hunched forward, indicating focus and engagement. Gradually, her expression shifts to one of delight, with a smile breaking across her face, suggesting a positive revelation or achievement. Her fingers move deftly over the laptop\'s keyboard, showcasing her active participation in whatever task she is undertaking.
[camera caption] (94 words) The camera work is steady and focused, primarily using a medium shot that captures both smartphones in their entirety. The angle is slightly above the devices, providing a clear view of the screens and the presenter\'s hands. There are no significant camera movements; instead, the focus remains on the devices as they load the game. The video includes a timer overlay at the bottom, which counts the seconds, enhancing the comparative aspect of the performance test. The video opens with a visually striking presentation of two smartphones positioned side by side against a pristine white background, which serves to enhance the focus on the devices themselves. The lighting is bright and evenly distributed, creating a neutral and distraction-free atmosphere that allows viewers to concentrate on the smartphones and their performance.
[detailed caption] (407 words) The video presents a captivating scene featuring a young woman with voluminous, curly hair, elegantly styled and framing her face. She is dressed in a light blue collared shirt that complements her complexion and adds a touch of calmness to the overall aesthetic. The setting is minimalistic, with a soft blue backdrop that enhances the serene atmosphere, creating a tranquil environment conducive to concentration. The lighting is bright and even, illuminating her features and the sleek surface of the laptop she is holding, while ensuring that no distracting elements are present in the background. This simplicity allows the viewer to focus entirely on her expressions and actions, which are central to the narrative.
As the video unfolds, the woman maintains a consistent posture, initially appearing serious and contemplative as she gazes intently at the laptop screen. Her brow is slightly furrowed, indicating deep thought and engagement with the task at hand. She leans slightly forward, her body language reflecting her focus and determination. The camera captures her in a series of close-up shots, providing an intimate view of her facial expressions and hand movements. The angles are primarily frontal, allowing the audience to connect with her emotional journey as she interacts with the device.
Throughout the sequence, the woman\'s expression transitions from one of concentration to a radiant smile, suggesting a journey of discovery or realization. As she navigates through the content on her laptop, her fingers move deftly over the keyboard, showcasing her active participation and engagement with whatever task she is undertaking. The smooth transitions between shots maintain a fluid narrative flow, with the focus subtly shifting between her face and the laptop screen. This technique emphasizes the connection between her emotional responses and the content she is interacting with, drawing the viewer deeper into her experience.
The overall composition of the video is clean and visually appealing, with a shallow depth of field that blurs the background slightly, ensuring that the viewer\'s attention remains on the subject. The stillness of the environment enhances the sense of concentration, while the eventual shift in her expression to one of delight signifies a positive revelation or achievement. This moment of joy, captured in her smile, serves as a poignant reminder of the satisfaction that can come from engaging with technology and the discoveries it can facilitate. The video encapsulates a moment of introspection and triumph, inviting the viewer to share in the woman\'s journey as she navigates her digital landscape.

Abstract

Baseline: Video detailed captioning is a key task which aims to generate comprehensive and coherent textual descriptions of video content, benefiting both video understanding and generation. In this paper, we propose AuroraCap, a video captioner based on a large multimodal model. We follow the simplest architecture design without additional parameters for temporal modeling. To address the overhead caused by lengthy video sequences, we implement the token merging strategy, reducing the number of input visual tokens. Surprisingly, we found that this strategy results in little performance loss. AuroraCap shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and Gemini-1.5 Pro (82.2).

Benchmark and Metric: However, existing video caption benchmarks only include simple descriptions, consisting of a few dozen words, which limits research in this field. Therefore, we develop VDC, a video detailed captioning benchmark with over one thousand carefully annotated structured captions. In addition, we propose a new LLM-assisted metric VDCScore for bettering evaluation, which adopts a divide-and-conquer strategy to transform long caption evaluation into multiple short question-answer pairs. With the help of human Elo ranking, our experiments show that this benchmark better correlates with human judgments of video detailed captioning quality.

AuroraCap: A Efficient and Performant Video Detailed Captioner


Architecture

LLaVA. To effectively leverage the capabilities of both the pre-trained LLM and visual model, LLaVA adapt a simple multilayer perceptron (MLP) projection layer to connect each patch tokens of image features into the word embedding space.

Token merging. To increase the throughput of existing ViT models, Token Merging is proposed to gradually combines similar tokens in a transformer to reduce the number of tokens passing through ViT models. Token Merging has been proven to be effective on image and video classification tasks even without the need for training. We conduct frame-wise token merging in AuroraCap, where the feature is extracted by CLIP ViT-H model. We show token merging visualization examples from COCO, VG, SA-1B as follows:

Token merging visualization. From left to right, the number of visual tokens representing the images are 490, 154, 18, and 6.


Training Recipe

We use over 20 million high-quality image/video-text pairs to train AuroraCap in three stages. The training datasets are released at HuggingFace.

Pretraining stage. We first align visual features with the word embedding space of LLMs. To achieve this, we freeze the pretrained ViT and LLM, training solely the vision-language connector.

Vision stage. We unfreeze the pretrained ViT while freezing the LLM during vision stage and train with the public data among various computer vision tasks to get better generalization.

Language stage. Finally, we conduct end-to-end training, which means all the components are trainable, with the most high-quality public data during language stage.

VDC: A New Video Detailed Captioning Benchmark


Benchmark Collection and Processing

Video collection and processing. We building VDC upon Panda-70M, Ego4D, Mixkit, Pixabay, and Pexels. We first split the video into clips and apply dense frame extraction, then manually replacing blurry frames with adjacent clear ones.

Table 1 : Benchmark comparison for video captioning task. Ave. Length indicates the average number of words per caption.
Dataset Theme # Video # Clip # Caption # Word # Vocab. Ave. Length
MSVD Open 1,970 1,970 70,028 607,339 13,010 8.67
MSR-VTT Open 7,180 10,000 200,000 1,856,523 29,316 9.28
ActivityNet Open 20,000 100,000 100,000 1,340,000 15,564 13.40
S-MiT Open 515,912 515,912 515,912 5,618,064 50,570 10.89
M-VAD Movie 92 48,986 55,905 519,933 18,269 9.30
MPII-MD Movie 94 68,337 68,375 653,467 24,549 9.56
Youcook2 Cooking 2,000 15,400 15,400 121,418 2,583 7.88
Charades Human 9,848 10,000 27,380 607,339 13,000 22.18
VATEX Open 41,300 41,300 413,000 4,994,768 44,103 12.09
VDC (ours) Open 1,027 1,027 1,027 515,441 20,419 500.91

Structured detailed captions construction pipeline. We develop a structured detailed captions construction pipeline to generate extra detailed descriptions from various perspectives, significantly extending the length and enhancing the richness compared to previous benchmarks. The structured detailed captions includes camera, short, background, main object, and detailed captions.

  1. Camera caption. Describe the camera work in detail, including shot types, angles, movements, transitions, and any special effects used to enhance the video.
  2. Short caption. Summarize the video in one detailed sentence, capturing key actions and the overall mood.
  3. Background caption. Provide a detailed description of the background, including objects, location, weather, time, and any dynamic elements.
  4. Main Object caption. Give a thorough description of the main subject's actions, attributes, interactions, and movements throughout the video frames.
  5. Detailed caption. Generate a detailed, vivid caption for the video, covering all categories, ensuring it's engaging, informative, and rich enough for AI to recreate the video content.

To generate detailed, fine-grained, and accurate captions, we leverage GPT-4o to produce video descriptions. We design a hierarchical prompt strategy to efficiently obtain accurate structured captions and detailed captions in two conversation rounds: (1) Structured Captions Generation and (2) Detailed Captions Integration.


Video length in VDC
Distribution of the video length and structured caption length in VDC.

VDCscore: Evaluating Detailed Captions with LLMs

We introduce VDCscore, a novel quantitative metric that utilizes LLMs to evaluate the similarity between predicted and ground-truth detailed captions through a divide-and-conquer approach. The core idea of VDCscore is to decompose long detailed captions into multiple short question-answering pairs, avergae the evaluation of each pair as the final result.

VDCscore evaluation pipeline.

Evaluation


Benchmarking video detailed captioning.

AuroraCap achieves superior performance in video detailed captioning while utilizing significantly fewer visual tokens than other models, fully highlighting the efficiency of AuroraCap.

Video length in VDC
Comparison between various models with different number of visual tokens input on VDC.

Ablation Study

As a core training and inference strategy of AuroraCap, token merging plays a significant role in reducing the number of visual tokens. We further study how the video detailed captioning capability is influenced by token merge ratio.

Ablation study
Visualization of token merging ratio on various image and video understanding tasks. The solid line indicates the average performance across various tasks, and the shaded area represents performance variability.

We define the performance percentage as the proportion between the highest and lowest values on the entire performance curve. We highlight the token merging ratio when achieving 90% and 80% performance with the dash line and filled area. We found that token merging significantly reduces the number of tokens while maintaining minimal performance drop, and even showing improvement in some tasks.

Ablation study on token merging ratio on various image and video understanding tasks.

To assess the inference speed, we utilize the inference time per video question-answering pair in seconds (TPV) as an evaluative metric. Figure below indicates the minimum TPV achievable in our settings including with or without token merging and SGLang across seven video understanding datasets. Reducing the visual tokens and using SGLang result in excellent inference times per video question-answering pair while all the datasets with short video and question inputs.

Video length in VDC
Comparison between different inference settings: A: Rvtk = 1.0, without SGLang, B: Rvtk = 0.1, without SGLang, C: Rvtk = 1.0, with SGLang, D: Rvtk = 0.1, with SGLang. The number indicates the maximum inference time in seconds for each benchmark.

Case Study

We perform an extensive case study of AuroraCap on a variety of videos for video detailed captioning. As shown as followings, AuroraCap is capable of providing excellent detailed captions regarding the camera motion, background and main object with less hallucination.

BibTeX

                @article{auroracap,
                    title={AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark},
                    author={Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning},
                    year={2024},
                    journal={arXiv preprint arXiv:2410.03051},
                }