About
Wenhao Chai
[CV] [Google Scholar] [GitHub] [Twitter (X)] [Hugging Face] [instagram] [LinkedIn]
Wenhao Chai is currently a graduate student at University of Washington, with Information Processing Lab advised by Prof. Jenq-Neng Hwang. Previously, he was an undergradate student at Zhejiang University, with CVNext Lab advised by Prof. Gaoang Wang. He is fortunate to have internship at Pika Lab and Microsoft Research Asia.
His research primarily in large multimodal models for video understanding, generative models, embodied agent, as well as human pose and motion.
Aurora Series
09/2024: We release the first work in Aurora series, AuroraCap, in collaboration with Pika Lab, Stanford, MIT, Harvard, and NYU, which aims to build a new generation of image and video captioning.
MovieChat
07/2023: We release MovieChat, the first framework that can chat with over ten thousands frames of video, accepted to CVPR 2024. We also host LOVEU: LOng-form VidEo Understanding challenge in CVPR 2024!
STEVE Series
12/2023: We release STEVE series, named after the protagonist of the game Minecraft, aims to build an embodied agent based on the MLLMs. This series of works have been accepted by ECCV 2024, as well as the workshops of ICLR and CVPR.
StableVideo
07/2023: We release StableVideo, a diffusion-based framework for text-driven video editing, which is accepted to ICCV 2023. The project repo has gained over 1.4k stars at GitHub.
Check Out
News and Highlights
- 10/2024: We start to write blogs. Check it out here.
- 07/2024: Two papers accepted to ACM MM 2024.
- 07/2024: Two papers accepted to ECCV 2024.
- 06/2024: One technique report accepted to CVPR 2024 workshop @ NTIRE.
- 06/2024: We are working with Pika Lab to develop next-generation video understanding and generation models.
- 05/2024: One paper accepted to CVPR 2024 workshop @ Embodied AI.
- 04/2024: We are hosting CVPR 2024 Long-form Video Understanding Challenge @ LOVEU.
- 04/2024: Invited talk at AgentX seminar about our STEVE series works.
- 03/2024: One paper accepted to ICLR 2024 workshop @ LLM Agents.
- 02/2024: Two papers accepted to CVPR 2024 (1 highlight).
- 02/2024: Invited talk at AAAI 2024 workshop @ IMAGEOMICS.
- 12/2023: One paper accepted to ICASSP 2024.
- 12/2023: One paper accepted to AAAI 2024.
- 11/2023: Two papers accepted to WACV 2024 workshop @ CV4Smalls.
- 09/2023: One paper accepted to ICCV 2023 workshop @ TNGCV-DataComp.
- 09/2023: One paper accepted to IEEE T-MM.
- 08/2023: One paper accepted to BMVC 2023.
- 07/2023: Two papers accepted to ACM MM 2023.
- 07/2023: Two papers accepted to ICCV 2023.
Recent
Projects
* Equal contribution. † Project lead. ‡ Corresponding author.
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai*†, Enxin Song*, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning
arXiv preprint, 2024
[Website]
[Paper]
[Model]
[Benchmark]
[Dataset]
[Code]
AuroraCap is a multimodal LLM designed for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.
StableVideo: Text-driven Consistency-aware Diffusion Video Editing
Wenhao Chai, Xun Guo‡, Gaoang Wang, Yan Lu
International Conference on Computer Vision (ICCV), 2023
[Website]
[Paper]
[Video]
[Demo]
[Code]
We tackle introduce temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the new objects.
MovieChat: From Dense Token to Sparse Memory in Long Video Understanding
Enxin Song*, Wenhao Chai*†, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang‡
Computer Vision and Pattern Recognition (CVPR), 2024
[Website]
[Paper]
[Blog]
[Dataset]
[Code]
MovieChat achieves state-of-the-art performace in extra long video (more than 10K frames) understanding by introducing memory mechanism.
See and Think: Embodied Agent in Virtual Environment
Zhonghan Zhao*, Wenhao Chai*†, Xuan Wang*, Boyi Li, Shengyu Hao, Shidong Cao, Tian Ye, Jenq-Neng Hwang, Gaoang Wang‡
European Conference on Computer Vision (ECCV), 2024
[Website]
[Paper]
[Video]
[Dataset]
[Model]
STEVE, named after the protagonist of the game Minecraft, is the framework aims to build an embodied agent based on the vision model and LLMs within an open world.