Our Research

Research

Wenhao Chai

Our research primarily in embodied agent, video understanding, generative models, as well as human pose and motion.

Generative Models

Artificial Intelligence Generated Content (AIGC) has been a hotpot topic these days. We has been working on various tasks related to generative AI, including image editing [Diffashion], video editing [StableVideo], 3D object generation [Consist3D] and city layout generation [CityGen].

Human Pose and Motion

Human pose estimation is an essential computer vision task which aims to estimate the coordinates of joints from single-frame images or videos. Our research focus on novel architecture [ZeDO], domain adaptation [PoseDA] [PoSynDA], as well as the large-scale pre-training [MPM] [UniHPE]. We also interested in animal [UniAP] and infent [ZeDO-i] pose estimation.

Embodied Agent

Large language models (LLMs) have achieved impressive progress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot. We propose [STEVE] [STEVE-1.5] [STEVE-2], a comprehensive and visionary embodied agent series in the Minecraft virtual environment.

Video Understanding

Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. We propose [MovieChat] to support ultra long video understanding tasks with new collected benchmark.

Research

Video Demos