Ailing Zeng 曾爱玲

Biography

I'm a technical staff member at Anuttacon, leading the development of a human-centric interactive multimodal video generation system. These models enable agents to perceive, interact with, and generate real-time, long-horizon video behaviors. Previously, I spent three wonderful years at Tencent Hunyuan&AI Lab and International Digital Economy Academy (IDEA), leading a human-centric perception and generation research team. I obtained my Ph.D. from the Department of Computer Science and Engineering, the Chinese University of Hong Kong, supervised by Prof. Qiang Xu. I was a visiting scholar in the Robotics Institute, Carnegie Mellon University.

Some previous research works,

1) Human-centric visual perception with large-scale data and generic models: IDOL, AiOS, SMPLer-X, OSX, DW-Pose, ED-Pose, SmoothNet, DeciWatch

2) Large-scale multi-modality datasets: Motion-X, UBody, Uni-KPT, BallPlay, HuMMan, Human-Art

3) Human-centric generative models: MotionCraft, HumanSD, PhysHOI, Dreamwaltz, HumanTOMATO, DiffSHEG

4) Interactive AI & Human-in-the-loop techniques: X-Pose, Click-Pose, Grounded-SAM

5) Previously, time series analysis and forecasting: LTSF-Linear, SCINet, FITS

We are hiring full-time researchers, engineers, and interns based in Mountain View or Singapore, see our open roles. Feel free to reach out if you are interested.

News

[2026.04] We introduce LPM 1.0, a video-based character performance model that generates real-time video with full-duplex conversation, identity-consistent infinite-length generation, and nuanced human-like performance: [Project Page] [Paper].
[2026.01] We are hosting the ICLR 2026 Tutorial on AI with Recursive Self-Improvement, welcome to submit papers: [Link].
[2025] 5 papers were accepted to ICCV/CVPR/AAAI/ICML 2025, 4 papers were accepted to TPAMI.
[2025.10] Invited talk on the "1st Workshop on Interactive Human-centric Foundation Models", ICCV 2025.
[2025.10] We are hosting the SIGGRAPH Asia 2025 Workshop on " Towards Embodied Intelligence Across Humans, Avatars, and Humanoid Robotics".
[2025.03] Please check and register our AI-Native Game project Whispers from the Star, stay tuned!
[2025.02] Invited talk on video generation at the Max Planck Institute for Intelligent Systems, hosted by Michael J. Black.
[2024] 13 papers were accepted to CVPR/ICML/ICLR/ECCV/SIGGRAPH-Asia/NeurIPS 2024.
[2024.10] I serve as an Area Chair for CVPR 2025.
[2024.05] LTSF-Linear was selected as the most Influential Paper in AAAI 2023!
[2024.04] We are hosting the ECCV 2024 Tutorial on “ Recent Advances in Video Content Understanding and Generation” (VENUE).
[2023.12] We are hosting the CVPR 2024 Workshop on “ Computer Vision with Humans in the Loop” (CVHL).
[2023] 13 papers were accepted to CVPR/ICLR/NeurIPS/ICCV/AAAI 2023.

Selected Research

See full list at Google Scholar. (*equal contribution, ^#corresponding author or project lead)

LPM 1.0: Video-based Character Performance Model [Project Page]
Ailing Zeng^#, Casper Yang, Chauncey Ge, Eddie Zhang, Garvey Xu, Gavin Lin, Gilbert Gu, Jeremy Pi, Leo Li, Mingyi Shi, Sheng Bi, Steven Tang, Thorn Hang, Tobey Guo, Vincent Li, Xin Tong^#, Yikang Li, Yuchen Sun, Yue (R) Zhao, Yuhan Lu, Yuwei Li, Zane Zhang, Zeshi Yang, Zi Ye
technical report, 2026

LPM 1.0 generates real-time video with full-duplex conversation, identity-consistent infinite-length generation, and nuanced human-like performance.

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models [Website]
Ailing Zeng^*, Yuhang Yang^*, Weidong Chen, Wei Liu
technical report, 2024

A systemantic empirical study on 21 SORA-like text-to-video, image-to-video, video-to-video models.

IDOL: Instant Photorealistic 3D Human Creation from a Single Image [Code] [Data]
Yiyu Zhuang*, Jiaxi Lv*, Hao Wen*, Qing Shuai, Ailing Zeng^#, Hao Zhu^#, Shifeng Chen, Yujiu Yang, Xun Cao, Wei Liu
CVPR, 2025

Rethink 3D human reconstruction from an image via a feed-forward ViT model trained on 100K multi-view subjects, making the model fast, photo-realistic, and generalizable.

SkillMimic: Learning Reusable Basketball Skills from Demonstrations [Code] [Data]
Yinhuai Wang, Qihan Zhao, Runyi Yu, Ailing Zeng^#, Jing Lin, Zhengyi Luo, Hok Wai Tsui, Jiwen Yu, Xiu Li, Qifeng Chen, Jian Zhang^#, Lei Zhang, Ping Tan
CVPR, 2025

It proposes a unified reward to learn diverse basketball skills for physically and dynamically plausible humanoid-ball motion simulation from the proposed human-ball motion datasets.

MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls [Code]
Yuxuan Bian, Ailing Zeng^#, Xuan Ju, Xian Liu, Zhaoyang Zhang, Wei Liu, Qiang Xu
AAAI, 2025

MotionCraft unifies (text2motion, speech2gesture, music2dance) whole-body motion generation within a plug-and-play multimodal DiT model.

DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion [Code]
Yukun Huang*, Jianan Wang*, Ailing Zeng, Zheng-Jun Zha, Lei Zhang, Xihui Liu
Extended Version of DreamWaltz [Neurips 2023]

It creates high-quality text-driven 3D avatars and expressive whole-body animation via the proposed Skeleton-guided Score Distillation and Hybrid 3D Gaussian Avatar Representation.

Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation [Code]
Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Heng Pan, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, Qifeng Chen
SIGGRAPH Asia, 2024

A keypoint controllable diffusion-based framework for expressive image portrait animation.

X-Pose: Detecting Any Keypoints [Code]
Jie Yang, Ailing Zeng^#, Ruimao Zhang^#, Lei Zhang
European Conference on Computer Vision (ECCV), 2024

UniPose supports textual and visual prompts to detect arbitrary keypoints (e.g., from articulated, rigid, to soft objects).

HumanTOMATO: Text-aligned Whole-body Motion Generation [Code]
Shunlin Lu*, Ling-Hao Chen*, Ailing Zeng^#, Jing Lin, Ruimao Zhang^#, Lei Zhang, Heung-Yeung Shum^#
The 41st International Conference on Machine Learning (ICML), 2024

The first text-driven whole-body motion generation method with an explicit text-motion alignment.

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks [Code]
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, Lei Zhang
Technical report, 2024

Combining the segment anything model (SAM) with X, e.g., Grounding DINO or OSX for various visual tasks.

PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction [Homepage] [Code]
Yinhuai Wang, Jing Lin, Ailing Zeng^#, Zhengyi Luo, Jian Zhang^#, Lei Zhang
Technical report, 2023

The physics-based whole-body Human-Object Interaction (HOI) imitation approach for dynamic HOI and a proposed BallPlay dataset.

AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation [Homepage] [Code]
Qingping Sun*, Yanjun Wang*, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi Sing Leung, Ziwei Liu, Lei Yang, Zhongang Cai
Rank Top-1 on AGORA SMPL-X leaderboard! [link]

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

An end-to-end DETR-based model for multi-person whole-body mesh recovery.

FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions [Homepage] [Code] [Data]
Jiong Wang, Fengyu Yang, Wenbo Gou, Bingliang Li, Danqi Yan, Ailing Zeng, Yijun Gao, Junle Wang, Yanqing Jing, Ruimao Zhang
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

The first large-scale real-world multi-view dataset comprises 11M frames from 8k sequences captured by synchronizing 8 smartphones.

GPAvatar: Generalizable and Precise Head Avatar from Image(s) [Homepage] [Code]
Xuangeng Chu, Yu Li, Ailing Zeng, Tianyu Yang, Lijian Lin, Yunfei Liu, Tatsuya Harada
The Twelfth International Conference on Learning Representations (ICLR), 2024

A framework for 3D head avatar reconstruction from one or several images in a single forward pass.

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset [Homepage] [Code]
Jing Lin∗, Ailing Zeng∗^#, Shunlin Lu∗, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang
Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023

A large-scale 3D expressive whole-body human motion dataset (over 10M frames) with SMPL-X, text, audio, and RGB modalities.

SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation [Homepage] [Code]
Zhongang Cai*, Wanqi Yin*, Ailing Zeng, Chen Wei, Qingping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, Ziwei Liu
Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023

Rank Top-1 on 7 SMPL/SMPL-X benchmarks!

SMPL-X estimation with a systematic investigation on 32 datasets and model scaling-up.

DreamWaltz: Make a Scene with Complex 3D Animatable Avatars [Homepage] [Code]
Yukun Huang*, Jianan Wang*, Ailing Zeng, He Cao, Xianbiao Qi, Yukai Shi, Zheng-Jun Zha, Lei Zhang
Conference on Neural Information Processing Systems (NeurIPS), 2023

A text-driven 3D animatable avatar creation framework with complex scenes.

Effective Whole-body Pose Estimation with Two-stages Distillation [Code]
Zhendong Yang, Ailing Zeng, Chun Yuan, Yu Li
The Thirty-Fourth IEEE/CVF Conference on International Conference on Computer Vision (ICCV), CV4Metaverse Workshop, 2023

Rank Top 1 on COCO-WholeBody Benchmark. A better alternative to OpenPose.

HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation [Homepage] [Code]
Xuan Ju*, Ailing Zeng*, Chenchen Zhao*, Jianan Wang, Lei Zhang, Qiang Xu
(Oral) The Thirty-Fourth IEEE/CVF Conference on International Conference on Computer Vision (ICCV), 2023

HumanSD highlights multi-scenario human-centric image generation with precise pose control.

Neural Interactive Keypoint Detection [Code]
Jie Yang, Ailing Zeng^#, Feng Li, Shilong Liu, Ruimao Zhang^#, Lei Zhang
The Thirty-Fourth IEEE/CVF Conference on International Conference on Computer Vision (ICCV), 2023

The first end-to-end neural interactive keypoint detection/annotation framework significantly reduces 10+ times labeling costs.

One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer [Homepage] [Code]
Rank Top-1 on AGORA SMPL-X leaderboard! (2023.04) [link]

Jing Lin, Ailing Zeng^#, Haoqian Wang, Lei Zhang, Yu Li
The Thirty-Fourth IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

The first one-stage framework for 3D whole-body mesh recovery and an large-scale upper-body dataset.

Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes [Homepage] [论文解读] [Code]
Xuan Ju, Ailing Zeng^#, Jianan Wang, Qiang Xu, Lei Zhang
The Thirty-Fourth IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Human-Art contains 50k high-quality images with over 123k person instances from 5 natural and 15 artificial scenarios, which are annotated with bounding boxes, keypoints, self-contact points, and text information for humans represented in both 2D and 3D.

Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation [论文解读] [Code]
Jie Yang, Ailing Zeng^#, Shilong Liu, Feng Li, Ruimao Zhang^#, Lei Zhang
Eleventh International Conference on Learning Representations (ICLR), 2023

We present a novel end-to-end framework with Explicit box Detection for multi-person Pose estimation.

Are Transformers Effective for Time Series Forecasting? [Code] [bilibili]
Ailing Zeng, Muxi Chen, Lei Zhang, Qiang Xu
(Oral) the most Influential Paper in AAAI 2023, Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI), 2023

We try to question the validity of Transformer-based time series forecasting solutions via a linear layer.

DeciWatch: A Simple Baseline for 10× Efficient 2D and 3D Pose Estimation [Website] [Code] [Data]
Ailing Zeng, Xuan Ju, Lei Yang, Ruiyuan Gao, Xizhou Zhu, Bo Dai, Qiang Xu
European Conference on Computer Vision (ECCV), 2022

DeciWatch achieves 10+ efficiency improvement over existing works without any performance degradation for video-based 2D/3D human pose estimation.

SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos [Website] [Code] [知乎]
Ailing Zeng, Lei Yang, Xuan Ju, Jiefeng Li, Jianyi Wang, Qiang Xu
European Conference on Computer Vision (ECCV), 2022

SmoothNet is a plug-and-play refinement network to improve temporal smoothness and per-frame precision of any existing pose estimators.

HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling [Website] [Data]
Zhongang Cai*, Daxuan Ren*, Ailing Zeng*, Zhengyu Lin*, Tao Yu*, Wenjia Wang*, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, Fangzhou Hong, Mingyuan Zhang, Chen Change Loy, Lei Yang^, Ziwei Liu^
(Oral) European Conference on Computer Vision (ECCV), 2022

HuMMan is a large-scale multi-modal 4D human dataset with 1000 human subjects, 400k sequences and 60M frames.

SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-and-Recombine Approach [Code]
Ailing Zeng, Xiao Sun, Fuyang Huang, Minhao Liu, Qiang Xu, Stephen Lin
2020 European Conference on Computer Vision (ECCV’20)

We design a split-and-recombine approach to improve generalization performance in 3D human pose estimation, especially on rare and unseen poses.

Honors & Awards

KAUST AI Rising Star, 2025
Shenzhen Artificial Intelligence Natural Science Award, 2023
Shenzhen Pengcheng special talent award, 2023
Full Postgraduate Studentship, CUHK (2017 - 2021)
Excellent League Member, Top 1% student of Xiamen University
National Scholarship (2015)