Our Research
Research
Wenhao Chai
Our research primarily in embodied agent, video understanding, generative models, as well as human pose and motion.
Generative Models
Artificial Intelligence Generated Content (AIGC) has been a hotpot topic these days. We has been working on various tasks related to generative AI, including image editing [Diffashion], video editing [StableVideo], 3D object generation [Consist3D] and city layout generation [CityGen].
Human Pose and Motion
Human pose estimation is an essential computer vision task which aims to estimate the coordinates of joints from single-frame images or videos. Our research focus on novel architecture [ZeDO], domain adaptation [PoseDA] [PoSynDA], as well as the large-scale pre-training [MPM] [UniHPE]. We also interested in animal [UniAP] and infent [ZeDO-i] pose estimation.
Embodied Agent
Large language models (LLMs) have achieved impressive progress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot. We propose [STEVE] [STEVE-1.5] [STEVE-2], a comprehensive and visionary embodied agent series in the Minecraft virtual environment.
Video Understanding
Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. We propose [MovieChat] to support ultra long video understanding tasks with new collected benchmark.
Research
Video Demos
Selected
Media Coverage | Blog
- 视觉模型+大语言模型: 首个支持10K+帧长视频理解任务的新型框架
By PaperWeekly - StableVideo: Text-driven Consistency-aware Diffusion Video Editing
By @_akhaliq