MovieChat

Enxin Song^1,∗, Wenhao Chai^2,∗,†, Guanhong Wang^1,∗, Yucheng Zhang^1,‡,
Haoyang Zhou^1,‡, Feiyang Wu^1,‡, Haozhe Chi¹, Xun Guo^3,✉,
Tian Ye⁴, Yanting Zhang⁵, Yan Lu³, Jenq-Neng Hwang², Gaoang Wang^1,✉

¹Zhejiang Unversity ²University of Washington ³Microsoft Research Asia
⁴Hong Kong University of Science and Technology (GZ) ⁵Donghua University

^∗Equal contribution ^†Project lead ^‡Data collection ^✉Corresponding author

Overview

Demo Video

Comparison Case

Breakpoint Mode

What is Chief Bogo doing?

In the video, the character Chief Bogo is seen talking to someone on a laptop while sitting at a table in a dark room. Chief Bogo is a cartoon character wearing a brown outfit.

Who is this rabbit named Judy talking to?

In the video, the rabbit police named Judy is talking to the person in a movie theater, who is a fox.

What is she doing?

In the given video, we see a young woman walking down a street in the city, and a young man also walks down the street behind her. The woman is wearing a red and black outfit.

Where is this couple?

This couple is standing in the middle of a city street with a black hat on the man's. The scene is shot in a park, and there are a few trees and bushes visible.

Global Mode


"Can you describe this video?"
"The video shows various scenes of an animation with different animals in a restaurant or office environment. At the beginning, we see an animated cartoon rabbit and a deer talking to each other in a restaurant. Next, there is a cartoon bunny talking to a cat at a desk, followed by a scene of a cartoon deer and a mouse talking to each other. The video then continues with the animated cartoon dog and cat sitting at a table in a restaurant, and another scene where a man and a dog are in a room with a computer on a table. There are more scenes of the same characters later on, including one with a bunny and a fox sitting at a desk, and another with a bunny and a dog talking to each other in the background. Finally, there are several more scenes with different animals, including a fox, a mouse, and a cat, all of which are in a different environment."

"Can you describe this video?"

"The video shows various scenes of an animation with different animals in a restaurant or office environment. At the beginning, we see an animated cartoon rabbit and a deer talking to each other in a restaurant. Next, there is a cartoon bunny talking to a cat at a desk, followed by a scene of a cartoon deer and a mouse talking to each other. The video then continues with the animated cartoon dog and cat sitting at a table in a restaurant, and another scene where a man and a dog are in a room with a computer on a table. There are more scenes of the same characters later on, including one with a bunny and a fox sitting at a desk, and another with a bunny and a dog talking to each other in the background. Finally, there are several more scenes with different animals, including a fox, a mouse, and a cat, all of which are in a different environment."


"Can you tell me what happened in this movie?"
"Certainly! In the video, you can see a group of people standing in a park with a green hill in the background. There are also people walking on the street and in a courtyard with trees. Some of the people are holding red umbrellas, and there is a woman wearing a red coat. She is then seen walking down a street at night with a man following her. Next, we see a couple standing on a balcony overlooking a street in the evening with a man walking by and then walking down the street. Towards the end of the video, a group of people are seen walking through a snowy town."

"Can you tell me what happened in this movie?"

"Certainly! In the video, you can see a group of people standing in a park with a green hill in the background. There are also people walking on the street and in a courtyard with trees. Some of the people are holding red umbrellas, and there is a woman wearing a red coat. She is then seen walking down a street at night with a man following her. Next, we see a couple standing on a balcony overlooking a street in the evening with a man walking by and then walking down the street. Towards the end of the video, a group of people are seen walking through a snowy town."


"What plot does this movie clip include?"
"The movie clip starts with a black background and text written in white on it. Next, we see a man in a black robe with a shining light and a large white flag on the background. There are also people standing and sitting in front of a large building on a black background with text. Then, we see a man with a beard and a man in a dress, standing on the ground. We also see a man with a shirt on and a man in a blue shirt standing on the ground. Finally, we see a black background with a large black bear and a large bird on it."

"What plot does this movie clip include?"

"The movie clip starts with a black background and text written in white on it. Next, we see a man in a black robe with a shining light and a large white flag on the background. There are also people standing and sitting in front of a large building on a black background with text. Then, we see a man with a beard and a man in a dress, standing on the ground. We also see a man with a shirt on and a man in a blue shirt standing on the ground. Finally, we see a black background with a large black bear and a large bird on it."


"Which movie sequence lasts the longest?"
"The video shows the scene of a dark night with a football field and a group of people playing football."

Benchmark: MovieChat-1K

To better evaluate the performance of MovieChat, we collect a new benchmark for long video understanding tasks, MovieChat-1K, which contains 1K high quality video clips sourced from various movies and TV series with 14K manual annotations.

To the best of our knowledge, a long video understanding dataset has not yet been established. Our work represents the initial step in creating and making it publicly available.We create MovieChat1K, containing 1k long videos and corresponding 1k dense captions, and 13k visual question-answer pairs.For each video, we manually set and provide 1 dense caption for the whole video, 3 question-answering pairs for global mode and 10 question-answering pairs with timestamps for breakpoint mode.

We collect videos from 15 popular categories with varying distribution, including documentary film, detective film, animation film, and so on. Among these, each video comprises multiple alternating scenes, contributing to a diverse and dynamic visual narrative within the context of the collection. Over 90% of the videos exhibit a duration ranging from 10K to 12K frames, while 14.6% of videos extending beyond 12K frames. Only 8.6% of videos have duration less than 10k frames.

Question-answering Pairs

Word Distribution

Note that MovieChat-1K is specifically designed for long video comprehension tasks, the majority of questions are open-ended, with only a quarter classified as multiple-choice questions, marked by initiators such as ‘Do,’ ‘Does,’ ‘Is,’ or ‘Are.’ We also compute the word distributions of our provided question-answer pairs, which includes common objects (people, clothes, etc.), time (day, night, etc.), scenes (indoor, outdoor, etc.), and so on.

Sentence length distribution

MovieChat1K exhibits diverse lengths of question-answer pairs in the segmented clip level. Despite the distribution of questionanswer pairs varies between the global mode and breakpoint mode, the majority of questions tends to concentrate between 5-15 words in length, while the length of answers generally have fewer than 10 words.

Dense Captions

To facilitate a more detailed understanding of long videos, we provide a dense caption for each video. MovieChat-1K exhibits diverse caption lengths in the segmented clip level. Approximately two-thirds of the clips have captions with 100-149 words, while one-fifth of the clip captions have fewer than 100 words. About 11% of clips have long captions with more than 150 words.

To analyze the word distribution of our generated captions, we compute their distributions. The resulting word distribution of the captions is presented in Fig. B6, which includes common objects (man, woman, people, girl, etc.), attributes (detective, various, small, white, etc.), locations (inside, behind, south, next, etc.), scenes (room, house, building, office, etc.), actions/events (talk, enter, leave, take, etc.), and more.

In terms of actionness, MovieChat-1K captions contains nearly the same number of verbs as with the WebVid10M dataset. To evaluate this, we use the NLTK toolkit to analyze the number of verbs in captions, focusing on extracting and tagging all unique verbs. We find a total of 109,485 verbs in the WebVid10M caption dataset, while the MovieChat-1K captions contain 102,988 unique instances of verbs. While these counts may not be entirely accurate due to our simple counting method, we believe they provide a rough indication of the actionness of the two datasets.

Bibtex

@article{song2023moviechat,
title={MovieChat: From Dense Token to Sparse Memory for Long Video Understanding},
author={Song, Enxin and Chai, Wenhao and Wang, Guanhong and Zhang, Yucheng and Zhou, Haoyang and Wu, Feiyang and Guo, Xun and Ye, Tian and Lu, Yan and Hwang, Jenq-Neng and others},
journal={arXiv preprint arXiv:2307.16449},
year={2023}
}

MovieChat: From Dense Token to Sparse Memory in Long Video Understanding

Abstract