Wenhao Chai

CS Ph.D. Student @
Princeton University

Explore

Research

About

Wenhao Chai

Wenhao Chai is an incoming Ph.D. Student in Computer Science at Princeton University, working with Prof. Zhuang Liu. He received his master's degree at University of Washington in 2025 and bachelor's degree at Zhejiang University in 2023. He previously studied at Stanford University as a research intern in the summer of 2024 working with Prof. Christopher D. Manning and at the University of Illinois Urbana-Champaign as a visiting scholar in the spring and summer of 2022. He has internship at Pika Labs and Microsoft Research Asia. His research spans a wide range of topics in computer vision and deep learning. His previous research primarily covers video understanding and generation. He leads the development of MovieChat, the first Large Mutli-Modal Model for hour-long video understanding. He has co-organized workshops and challenges on video understanding at CVPR 2024 and 2025.

Check Out

News and Highlights

FAQ. To junior master/undergraduate students: if you would like to chat about life, career plan, or research ideas related to AI/ML. I will dedicate at least 30 mins every week for such meetings. I encourage students from underrepresented groups to reach out.
Internship. We are looking for interns and visiting students at Princeton University. Please refer to the content in the link for more details. We welcome passionate individuals to join our research community—feel free to reach out if you have any questions!
Join Discord. We are hosting Discord server among professors and students for daily sharing and research discussion.
Calendar. View my availability and upcoming events.

I am making my Schedule for CVPR 2025, from June 11th to June 15th, 2025, at the Music City Center in Nashville, TN. Message me if you'd like to join a road trip, coffee chat, or arrange a meal together.

09/2025: I join Princeton University as a CS Ph.D. student, working with Prof. Zhuang Liu. 2025 Fall application Record.
04/2025: We host CVPR 2025 Video Understanding Challenge @ LOVEU.
04/2025: One paper accepted to CVPR 2025 workshop @ Urban Scene Modeling.
03/2025: One paper accepted to CVPR 2025 workshop @ Efficient Large Vision Models.
03/2025: I graduate from University of Washington with the Thesis of Large Multi-Model Models for Video Captioning.
02/2025: Three papers accepted to CVPR 2025.
01/2025: Two papers accepted to ICLR 2025.
12/2024: Two papers accepted to AAAI 2025.
07/2024: Two papers accepted to ECCV 2024.
06/2024: One technique report accepted to CVPR 2024 workshop @ NTIRE.
06/2024: I work with Pika Labs as intern to develop next-generation video understanding and generation models.
05/2024: One paper accepted to CVPR 2024 workshop @ Embodied AI.
04/2024: We host CVPR 2024 Long-form Video Understanding Challenge @ LOVEU.
04/2024: Invited talk at AgentX seminar about our STEVE series works.
03/2024: One paper accepted to ICLR 2024 workshop @ LLM Agents.
02/2024: Two papers accepted to CVPR 2024 with one highlight (2.81%).
02/2024: Invited talk at AAAI 2024 workshop @ IMAGEOMICS.
12/2023: One paper accepted to AAAI 2024.
09/2023: One paper accepted to ICCV 2023 workshop @ TNGCV-DataComp.
07/2023: Two papers accepted to ICCV 2023.

Recent

Projects

* Equal contribution. ^† Project lead. ^‡ Corresponding author.

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai^†, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Saining Xie, Christopher D. Manning
International Conference on Learning Representations (ICLR), 2025
Project Page | Paper | Video | Model | Benchmark | Leaderboard | Poster | Code

AuroraCap is a multimodal LLM designed for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang^‡
arXiv preprint, 2024
Project Page | Paper | Video | Raw Result | Code

SAMURAI is a zero-shot visual tracking framework that adapts Segment Anything Model (SAM) for visual tracking with motion-aware memory.

MovieChat: From Dense Token to Sparse Memory in Long Video Understanding
Enxin Song*, Wenhao Chai*^†, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang^‡
Computer Vision and Pattern Recognition (CVPR), 2024
Project Page | Paper | Blog | Video | Dataset | Leaderboard | Code

MovieChat achieves state-of-the-art performace in extra long video (more than 10K frames) understanding by introducing memory mechanism.

StableVideo: Text-driven Consistency-aware Diffusion Video Editing
Wenhao Chai, Xun Guo^‡, Gaoang Wang, Yan Lu
International Conference on Computer Vision (ICCV), 2023
Project Page | Paper | Video | Demo | Code

We tackle introduce temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the new objects.

Wenhao Chai

About

Wenhao Chai

Check Out

News and Highlights

Recent

Projects

* Equal contribution. ^† Project lead. ^‡ Corresponding author.

Explore

More Pages

Mentoring

Calendar

PhD Application Record

CVPR 2025

Junior FAQ

Wenhao Chai

About

Wenhao Chai

Check Out

News and Highlights

Recent

Projects

* Equal contribution. † Project lead. ‡ Corresponding author.

Explore

More Pages

Mentoring

Calendar

PhD Application Record

CVPR 2025

Junior FAQ

* Equal contribution. ^† Project lead. ^‡ Corresponding author.