Our Research

Research

Wenhao Chai

My current research focus on developing embodied intelligence inspired by cognitive science principles to interact with the physical world, building upon video understanding as a core perceptual mechanism. I propose a long-short term memory framework modeled after the human memory system, enabling pre-trained video LMMs to comprehend multi-hour video content without additional fine-tuning. To enhance efficiency, I introduce token merging, significantly reducing visual tokens with minimal performance degradation. I also demonstrate step-by-step agent system development in Minecraft, showcasing cognitive-inspired agent capabilities in virtual environments.

I have just begun shaping my research narrative around embodied intelligence and cognitive science. With "pretraining as we know it will end", I believe the future of artificial intelligence lies in aligning with human intelligence—and ultimately surpassing it.



The following showcases my complete past research experience and fields with timeline. The template is from here.

Large Multi-Modal Models (06/2023 - present)

How to efficiently build and evaluate large multi-modal models?

How to involve large multi-modal models in embodied agent system?

Show/Hide Work on LMMs

Generative Models (03/2023 - 03/2024)

How to generate high-quality images, videos and 3D worlds?

How to control and evaluate the generated content?

Show/Hide Work on Generative Models

Human Pose and Motion (08/2022 - 08/2023)

How to estimate human pose and motion from images and videos?

How to generate realistic and controllable human motion?

Show/Hide Work on Human Pose and Motion