Our Research
Research
Wenhao Chai
My current research currently focus on developing visual intelligence to understand the physical world, building upon video understanding as a core perceptual mechanism. I propose a long-short term memory framework modeled after the human memory system, enabling pre-trained video Large Multi-modal Models (LMMs) to comprehend hour-long video content without additional fine-tuning. To enhance efficiency, I introduce token merging to LMMs, significantly reducing visual tokens with minimal performance degradation. I also demonstrate step-by-step agent system development in Minecraft, showcasing cognitive-inspired agent capabilities in virtual environments.
I have just begun shaping my research narrative around visual intelligence. With "pretraining as we know it will end", I believe the future of artificial intelligence lies in aligning with human intelligence—and ultimately surpassing it. I truly believe computer vision will continue to advance.
The following presents my comprehensive research experience and areas of focus, along with a timeline highlighting the periods when I was most actively engaged in each field. The template is from here.
How to efficiently build and evaluate large multi-modal models?
How to involve large multi-modal models in embodied agent system?
-
Video Understanding
- Long Video with Long-Short Term Memory MovieChat
- Long Video with Language Guidance Memory MovieChat+
- Video Detailed Captioning AuroraCap
How to generate high-quality images, videos and 3D worlds?
How to control and evaluate the generated content?
-
Video
-
Image
-
3D World
How to estimate human pose and motion from images and videos?
How to generate realistic and controllable human motion?
-
3D Human Pose Estimation
-
Motion Representation
-
Tracking
Featured
Videos

AuroraCap [Project Page]

SAMURAI [Project Page]

STEVE [Project Page]

StableVideo [Hugging Face Demo]

MovieChat [Project Page]
Organized
Workshops | Tutorials | Talks

5th International Workshop on Long-form Video Understanding
CVPR 2025, Nashville, TN
Workshop Organizer

4th International Workshop on Long-form Video Understanding: Towards Multimodal AI Assistant and Copilot
CVPR 2024, Seattle, WA
Workshop Organizer (Track 1)

First Workshop on Imageomics: Discovering Biological Knowledge from Images using AI
AAAI 2024, Vancouver, Canada
Invited Talk