Our Research
Research
Wenhao Chai
My current research focuses on developing embodied AI agents that interact with the physical world, with a particular emphasis on leveraging video understanding as a core perception tool. A key challenge in video understanding using Large Multi-modal Models (LMMs) is the efficiency during both training and inference. My work addresses this by proposing token merging, where visual tokens are significantly reduced with minimal performance drop. My work has illustrated that by introducing a long-short term memory mechanism, we can extend pre-trained video LMMs to understand videos spanning several hours without further fine-tuning. My work has further shown how to step-by-step build of agent systems in Minecraft.
Aurora Series:
Multimodal Understanding
Recently, integrating video foundation models and large language models to build a multimodal understanding system. Starting with AuroraCap, a state-of-the-art image and video captioning model. We also propose [MovieChat] to support ultra long video understanding tasks with new collected benchmark.
STEVE Series:
Embodied Agent
Large language models (LLMs) have achieved impressive progress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot. We propose [STEVE] [STEVE-1.5] [STEVE-2], a comprehensive and visionary embodied agent series in the Minecraft virtual environment.
Generative Models
Artificial Intelligence Generated Content (AIGC) has been a hotpot topic these days. We has been working on various tasks related to generative AI, including image editing [Diffashion], video editing [StableVideo], 3D object generation [Consist3D] and city layout generation [CityGen].
Human Pose and Motion
Human pose estimation is an essential computer vision task which aims to estimate the coordinates of joints from single-frame images or videos. Our research focus on novel architecture [ZeDO], domain adaptation [PoseDA] [PoSynDA], as well as the large-scale pre-training [MPM] [UniHPE]. We also interested in animal [UniAP] and infent [ZeDO-i] pose estimation.