Our Research

Research

Wenhao Chai

My current research focus on developing embodied intelligence inspired by cognitive science principles to interact with the physical world, building upon video understanding as a core perceptual mechanism. I propose a long-short term memory framework modeled after the human memory system, enabling pre-trained video LMMs to comprehend multi-hour video content without additional fine-tuning. To enhance efficiency, I introduce token merging, significantly reducing visual tokens with minimal performance degradation. I also demonstrate step-by-step agent system development in Minecraft, showcasing cognitive-inspired agent capabilities in virtual environments.

I have just begun shaping my research narrative around embodied intelligence and cognitive science. With "pretraining as we know it will end", I believe the future of artificial intelligence lies in aligning with human intelligence—and ultimately surpassing it.



The following presents my comprehensive research experience and areas of focus, along with a timeline highlighting the periods when I was most actively engaged in each field. The template is from here.

Large Multi-Modal Models (06/2023 - present)

How to efficiently build and evaluate large multi-modal models?

How to involve large multi-modal models in embodied agent system?

Show/Hide Work on LMMs

  • Video Understanding
  • Embodied Agent
Generative Models (03/2023 - 03/2024)

How to generate high-quality images, videos and 3D worlds?

How to control and evaluate the generated content?

Show/Hide Work on Generative Models

Human Pose and Motion (08/2022 - 08/2023)

How to estimate human pose and motion from images and videos?

How to generate realistic and controllable human motion?

Show/Hide Work on Human Pose and Motion

  • 3D Human Pose Estimation
    • + Domain Adaptation PoseDA
    • + Domain Adaptation and Diffusion PoSynDA
    • + Diffusion ZeDO
    • + Domain Adaptation and Diffusion ZeDO-i
    • Animal Pose Estimation UniAP
    • Use 4D Radar Tensor RT-Pose
  • Motion Representation
    • Mask Modeling MPM
    • Contrastive Learning UniHPE
  • Tracking



Organized

Workshops | Tutorials | Talks