Generative Models

We has been working on various tasks related to generative AI, including the creation and editing of image, video, 3D object, and 3D city.




  • StableVideo: Text-driven Consistency-aware Diffusion Video Editing
    Wenhao Chai, Xun Guo, Gaoang Wang, Yan Lu
    International Conference on Computer Vision (ICCV), 2023
    [Website] [Paper] [Video] [Demo] [Code]

  • Learning Diffusion Texture Priors for Image Restoration
    Tian Ye, Sixiang Chen, Wenhao Chai, Zhaohu Xing, Jing Qin, Ge Lin, Lei Zhu
    Computer Vision and Pattern Recognition (CVPR), 2024 (Highlight)
    [Paper]

  • DiffFashion: Reference-based Fashion Design with Structure-aware Transfer by Diffusion Models
    Shidong Cao*, Wenhao Chai*, Shengyu Hao, Yanting Zhang, Hangyue Chen, Gaoang Wang
    IEEE Transactions on Multimedia
    [Paper] [Code]

  • CityGen: Infinite and Controllable 3D City Layout Generation
    Jie Deng*, Wenhao Chai*, Jianshu Guo*, Qixuan Huang, Wenhao Hu, Jenq-Neng Hwang, Gaoang Wang
    arXiv Preprint.
    [Website] [Paper] [Code]

  • VersaT2I: Improving Text-to-Image Models with Versatile Reward
    Jianshu Guo*, Wenhao Chai*, Jie Deng*, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, Gaoang Wang
    arXiv Preprint.
    [Paper]

Video Understanding

We integrate video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks.




  • MovieChat: From Dense Token to Sparse Memory in Long Video Understanding
    Enxin Song*, Wenhao Chai*, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang
    Computer Vision and Pattern Recognition (CVPR), 2024
    [Website] [Paper] [Video] [Dataset] [Code]

  • MovieChat+: Question-aware Sparse Memory for Long Video Question Answering
    Enxin Song*, Wenhao Chai*, Tian Ye, Jenq-Neng Hwang, Xi Li, Gaoang Wang
    arXiv Preprint.
    [Website] [Paper] [Video] [Dataset] [Code]

Embodied Agent

Large language models (LLMs) have achieved impressive progress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot.




  • [STEVE-1] See and Think: Embodied Agent in Virtual Environment
    Zhonghan Zhao*, Wenhao Chai*, Xuan Wang*, Boyi Li, Shengyu Hao, Shidong Cao, Tian Ye, Jenq-Neng Hwang, Gaoang Wang
    arXiv Preprint.
    [Website] [Paper] [Dataset] [Code]

  • [STEVE-1.5] Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation
    Zhonghan Zhao*, Kewei Chen*, Dongxu Guo*, Wenhao Chai, Tian Ye, Yanting Zhang, Gaoang Wang
    International Conference on Learning Representations Workshop (ICLRW), 2024
    [Paper]

  • [STEVE-2] Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model
    Zhonghan Zhao, Ke Ma, Wenhao Chai, Xuan Wang, Kewei Chen, Dongxu Guo, Yanting Zhang, Hongwei Wang, Gaoang Wang
    arXiv Preprint.
    [Paper]

Human Pose and Motion

Human pose estimation is an essential computer vision task which aims to estimate the coordinates of joints from single-frame images or videos. Our research focus on novel architecture, domain adaptation, as well as the large-scale pre-training. We also interested in animal and infent pose estimation.




  • Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation
    Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang, Gaoang Wang
    International Conference on Computer Vision (ICCV), 2023
    [Paper] [Code]

  • UniAP: Towards Universal Animal Perception in Vision via Few-shot Learning
    Meiqi Sun*, Zhonghan Zhao*, Wenhao Chai*, Hanjun Luo, Shidong Cao, Yanting Zhang, Jenq-Neng Hwang, Gaoang Wang
    Association for the Advancement of Artificial Intelligence (AAAI), 2024
    [Website] [Paper] [Code]

  • PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Enhanced 3D Human Pose Estimation
    Hanbing Liu, Jun-Yan He, Zhi-Qi Cheng, Wangmeng Xiang, Qize Yang, Wenhao Chai, Gaoang Wang, Xu Bao, Bin Luo, Yifeng Geng, Xuansong Xie
    ACM Multimedia (ACM MM), 2023
    [Paper] [Code]

  • Back to Optimization: Diffusion-based Zero-Shot 3D Human Pose Estimation
    Zhongyu Jiang, Zhuoran Zhou, Lei Li, Wenhao Chai, Cheng-Yen Yang, Jenq-Neng Hwang
    Winter Conference on Applications of Computer Vision (WACV), 2024
    [Website] [Paper] [Code]

  • Exploring Learning-based Motion Models in Multi-Object Tracking
    Hsiang-Wei Huang, Cheng-Yen Yang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang
    arXiv Preprint.
    [Paper]