About
Wenhao Chai
Docs (updated on 12/22/2024):
[CV]
[Research Statement]
[Slides]
Research:
[Google Scholar]
[GitHub]
Social Media:
[Twitter]
[instagram]
[LinkedIn]
[Hugging Face]
I'm actively applying for Fall 2025 Ph.D. position. I am currently in the United States and thus will not have any potential visa issues.
Wenhao Chai is currently a graduate student at University of Washington, with Information Processing Lab advised by Prof. Jenq-Neng Hwang. Previously, he was an undergradate student at Zhejiang University, with CVNext Lab advised by Prof. Gaoang Wang. He is fortunate to work with Prof. Christopher D Manning at Stanford University, and have worked with Prof. Saining Xie and Prof. Yilun Du. He has internship at Pika Labs and Microsoft Research Asia. His research primarily in large multimodal models (LMMs) for video understanding, embodied agent, and generative models. He has published related papers in top-tier conferences and journals such as CVPR, ICCV, ECCV, and AAAI. He has also organized workshops and tutorials at CVPR and AAAI, and served as a reviewer for NeurIPS, ICLR, ICML, CVPR, ECCV, AAAI, and AISTATS.
I love photography, all the photos on this website are taken by me.
VDC Challenge at CVPR 2025
09/2024: We release the first work in Aurora series, AuroraCap, in collaboration with Pika Labs, Stanford, MIT, Harvard, and NYU, which aims to build a new generation of image and video captioning baseline and benchmark. We also host the first VDC Challenge at CVPR 2025.
MovieChat
07/2023: We release MovieChat, the first framework that can chat with over 10k frames of video, accepted to CVPR 2024 and selected as a highlight paper in Paper Digest. We also host LOVEU: LOng-form VidEo Understanding challenge in CVPR 2024!
SAMURAI
11/2024: We release SAMURAI, a great improvement of SAM 2 on visual object tracking task, which gains over one million hits in all social media and over 6,000 stars in GitHub in one week.
StableVideo
07/2023: We release StableVideo, a diffusion-based framework for text-driven video editing, which is accepted to ICCV 2023. The project repo has gained over 1,400 stars in GitHub.
Check Out
News and Highlights
- 04/2025: We are hosting CVPR 2025 Video Understanding Challenge @ LOVEU.
- 12/2024: Two papers accepted to AAAI 2025.
- 10/2024: First blog about exploring decoding auto-regressive LLMs in a diffusion way released.
- 07/2024: Two papers accepted to ECCV 2024.
- 06/2024: One technique report accepted to CVPR 2024 workshop @ NTIRE.
- 06/2024: MovieChat is selected as a highlight paper (rank 68) of CVPR 2024 in Paper Digest.
- 06/2024: I am working with Pika Labs as intern to develop next-generation video understanding and generation models.
- 05/2024: One paper accepted to CVPR 2024 workshop @ Embodied AI.
- 04/2024: We are hosting CVPR 2024 Long-form Video Understanding Challenge @ LOVEU.
- 04/2024: Invited talk at AgentX seminar about our STEVE series works.
- 03/2024: One paper accepted to ICLR 2024 workshop @ LLM Agents.
- 02/2024: Two papers accepted to CVPR 2024 with one highlight (2.81%).
- 02/2024: Invited talk at AAAI 2024 workshop @ IMAGEOMICS.
- 12/2023: One paper accepted to AAAI 2024.
- 09/2023: One paper accepted to ICCV 2023 workshop @ TNGCV-DataComp.
- 07/2023: Two papers accepted to ICCV 2023.
Recent
Projects
* Equal contribution. † Project lead. ‡ Corresponding author.
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai*†, Enxin Song*, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, Christopher D. Manning
arXiv preprint, 2024
[Project Page]
[Paper]
[Video]
[Model]
[Benchmark]
[Leaderboard]
[Dataset]
[Code]
AuroraCap is a multimodal LLM designed for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang‡
arXiv preprint, 2024
[Project Page]
[Paper]
[Video]
[Raw Result]
[Code]
SAMURAI is a zero-shot visual tracking framework that adapts Segment Anything Model (SAM) for visual tracking with motion-aware memory.
StableVideo: Text-driven Consistency-aware Diffusion Video Editing
Wenhao Chai, Xun Guo‡, Gaoang Wang, Yan Lu
International Conference on Computer Vision (ICCV), 2023
[Project Page]
[Paper]
[Video]
[Demo]
[Code]
We tackle introduce temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the new objects.
MovieChat: From Dense Token to Sparse Memory in Long Video Understanding
Enxin Song*, Wenhao Chai*†, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang‡
Computer Vision and Pattern Recognition (CVPR), 2024
[Project Page]
[Paper]
[Blog]
[Video]
[Dataset]
[Leaderboard]
[Code]
MovieChat achieves state-of-the-art performace in extra long video (more than 10K frames) understanding by introducing memory mechanism.
Join Us
Welcome Collaboration
We are finding potential collaborators of video understanding for Aurora Series, generatives models for videos, large language and multimodal models, and embodied agents. If you are interested in our research, please feel free to contact.
- Email address: wchai@uw.edu