Our Research

Wenhao Chai

Publication

Large Multi-Modal Models (06/2023 - Present)

How to efficiently build and evaluate large multi-modal models?

How to involve large multi-modal models in embodied agent system?

Video Understanding
- Long Video with Memory MovieChat
- Gated Memory MovieChat+
- Video Detailed Captioning AuroraCap
- +RWKV LongVidRWKV
- Lecture Benchmark Video-MMLU
Spatial Understanding
- Efficiency via Token Merging DTC
- Visual Grounding 3DVG-LLM
- +Token Merging ToSA
Embodied Agent
- Build in Minecraft STEVE-1
- Multi-Agent STEVE-1.5
- + Knowledge Distillation STEVE-2
- Embodied Mobile Manipulation EMMOE

Generative Models (03/2023 - Present)

How to generate high-quality images, videos and 3D worlds?

How to control and evaluate the generated content?

Image
- Style Transfer in Fassion Diffashion
- Restoration with Diffusion Prior DTPM
- + Reinforcement Learning VersaT2I
- Science Benchmark Science-T2I
- + LMM Dream Engine
Video
- Video Editing with Layered Representation StableVideo
3D World
- Single Object Consist3D
- City Layout in BEV CityGen
- City with Graphics Engine CityCraft

Human Pose and Motion (08/2022 - 08/2023)

How to estimate human pose and motion from images and videos?

How to generate realistic and controllable human motion?

3D Human Pose Estimation
- + Domain Adaptation PoseDA
- + Diffusion PoSynDA
- + Diffusion ZeDO
- + Diffusion ZeDO-i
- Animal Pose Estimation UniAP
- 4D Radar Tensor RT-Pose
Motion Representation
- Motion Generation PackDiT
- Mask Modeling MPM
- Contrastive Learning UniHPE
Tracking
- + Mamba MambaMOT
- Ego-Centric in 3D Space Ego3DT
- + SAM 2 SAMURAI