Our Research


Wenhao Chai

Our research primarily in embodied agent, video understanding, generative models, as well as human pose and motion.

Generative Models

Artificial Intelligence Generated Content (AIGC) has been a hotpot topic these days. We has been working on various tasks related to generative AI, including image editing [Diffashion], video editing [StableVideo], 3D object generation [Consist3D] and city layout generation [CityGen].

Human Pose and Motion

Human pose estimation is an essential computer vision task which aims to estimate the coordinates of joints from single-frame images or videos. Our research focus on novel architecture [ZeDO], domain adaptation [PoseDA] [PoSynDA], as well as the large-scale pre-training [MPM] [UniHPE]. We also interested in animal [UniAP] and infent [ZeDO-i] pose estimation.

Embodied Agent

Large language models (LLMs) have achieved impressive progress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot. We propose [STEVE] [STEVE-1.5] [STEVE-2], a comprehensive and visionary embodied agent series in the Minecraft virtual environment.

Video Understanding

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. We propose [MovieChat] to support ultra long video understanding tasks with new collected benchmark.


Video Demos