Wenhao Chai

Graduate Student @
University of Washington

About

Wenhao Chai

Docs: [CV] [Research Statement]
Research: [Google Scholar] Endpoint Badge [GitHub] GitHub User's stars
Social Media: [Twitter] [instagram] [LinkedIn] [Hugging Face]


Wenhao Chai is currently a graduate student at University of Washington, with Information Processing Lab advised by Prof. Jenq-Neng Hwang. Previously, he was an undergradate student at Zhejiang University, with CVNext Lab advised by Prof. Gaoang Wang. He is fortunate to work with Prof. Christopher D Manning at Stanford University, and have worked with Prof. Saining Xie and Prof. Yilun Du. He has internship at Pika Lab and Microsoft Research Asia. His research primarily in large multimodal models (LMMs) for video understanding, embodied agent, and generative models. He has published related papers in top-tier conferences and journals such as CVPR, ICCV, ECCV, and AAAI.

Aurora Series

09/2024: We release the first work in Aurora series, AuroraCap, in collaboration with Pika Lab, Stanford, MIT, Harvard, and NYU, which aims to build a new generation of image and video captioning.

View more

MovieChat

07/2023: We release MovieChat, the first framework that can chat with over ten thousands frames of video, accepted to CVPR 2024. We also host LOVEU: LOng-form VidEo Understanding challenge in CVPR 2024!

View more

STEVE Series

12/2023: We release STEVE series, named after the protagonist of the game Minecraft, aims to build an embodied agent based on the MLLMs. This series of works have been accepted by ECCV 2024, as well as the workshops of ICLR and CVPR.

View more

StableVideo

07/2023: We release StableVideo, a diffusion-based framework for text-driven video editing, which is accepted to ICCV 2023. The project repo has gained over 1.4k stars at GitHub.

View more

Check Out

News and Highlights

  • 10/2024: We start to write blogs. Check it out here.
  • 07/2024: Two papers accepted to ACM MM 2024.
  • 07/2024: Two papers accepted to ECCV 2024.
  • 06/2024: One technique report accepted to CVPR 2024 workshop @ NTIRE.
  • 06/2024: We are working with Pika Lab to develop next-generation video understanding and generation models.
  • 05/2024: One paper accepted to CVPR 2024 workshop @ Embodied AI.
  • 04/2024: We are hosting CVPR 2024 Long-form Video Understanding Challenge @ LOVEU.
  • 04/2024: Invited talk at AgentX seminar about our STEVE series works.
  • 03/2024: One paper accepted to ICLR 2024 workshop @ LLM Agents.
  • 02/2024: Two papers accepted to CVPR 2024 (1 highlight).
  • 02/2024: Invited talk at AAAI 2024 workshop @ IMAGEOMICS.
  • 12/2023: One paper accepted to ICASSP 2024.
  • 12/2023: One paper accepted to AAAI 2024.
  • 11/2023: Two papers accepted to WACV 2024 workshop @ CV4Smalls.
  • 09/2023: One paper accepted to ICCV 2023 workshop @ TNGCV-DataComp.
  • 09/2023: One paper accepted to IEEE T-MM.
  • 08/2023: One paper accepted to BMVC 2023.
  • 07/2023: Two papers accepted to ACM MM 2023.
  • 07/2023: Two papers accepted to ICCV 2023.

View more

Join Us

Welcome Collaboration

We are finding potential collaborators in terms of video Understanding (Aurora Series). If you are interested in our research, please feel free to contact.



  • Email address: wchai@uw.edu

Recent

Projects

* Equal contribution. Project lead. Corresponding author.


AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai*, Enxin Song*, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning
arXiv preprint, 2024
[Website] [Paper] [Model] [Benchmark] [Dataset] [Code]

AuroraCap is a multimodal LLM designed for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.

StableVideo: Text-driven Consistency-aware Diffusion Video Editing
Wenhao Chai, Xun Guo, Gaoang Wang, Yan Lu
International Conference on Computer Vision (ICCV), 2023
[Website] [Paper] [Video] [Demo] [Code]

We tackle introduce temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the new objects.

MovieChat: From Dense Token to Sparse Memory in Long Video Understanding
Enxin Song*, Wenhao Chai*, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang
Computer Vision and Pattern Recognition (CVPR), 2024
[Website] [Paper] [Blog] [Dataset] [Code] NPM

MovieChat achieves state-of-the-art performace in extra long video (more than 10K frames) understanding by introducing memory mechanism.

See and Think: Embodied Agent in Virtual Environment
Zhonghan Zhao*, Wenhao Chai*, Xuan Wang*, Boyi Li, Shengyu Hao, Shidong Cao, Tian Ye, Jenq-Neng Hwang, Gaoang Wang
European Conference on Computer Vision (ECCV), 2024
[Website] [Paper] [Video] [Dataset] [Model] NPM

STEVE, named after the protagonist of the game Minecraft, is the framework aims to build an embodied agent based on the vision model and LLMs within an open world.

Explore

More