Duomin Wang (王多民)'s Homepage

Duomin Wang (王多民)

Email: wangduomin[at]gmail.com Google Scholar Github

I am currently a senior researcher at Stepfun from 2024. My research interests include video generation, avatar synthesis and driving, human-centric world model and representation learning.

Before joining Stepfun, I was a researcher in Xiaobing for nearly three years, work closely with Yu Deng and Baoyuan Wang. Before that, I was worked at OPPO Research Institute for three years, my research results are applied to the camera software of OPPO mobile phones as the basic face algorithm.

I'm seeking research interns on human-centric video & world model. Feel free to send me an email if you are interested.

Let's explore together how to leverage world models to create more possibilities for digital human interactions.

News

2025/07/24 We have open-sourced the SpeakerVid-5M dataset and its data curation pipeline. Check it out here .

2025/02/27 Had One paper accepted by CVPR 2025 about video prediction (MAGI).

2024/08/08 Had One paper accepted by ECCV 2024 workshop EEC about agent avatar (AgentAvatar).

2024/07/01 Had One paper accepted by ECCV 2024 about 4D avatar synthesis (Portrait4D-v2).

2024/02/27 Had two papers accepted by CVPR 2024, one is about 4D avatar synthesis (Portrait4D), the other one is about unconstrained virtural try-on (PICTURE).

2023/07/14 Had one paper accepted by ICCV 2023 about talking head sythesis (TH-PAD).

2023/07/10 Our CVPR 2023 work PD-FGC has released the code and model, check it out!

2023/02/28 Had one paper accepted by CVPR 2023 about talking head sythesis (PD-FGC).

Publications

	SpeakerVid-5M: A Large-Scale High-Quality Dataset for audio-visual Dyadic Interactive Human Generation Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li arxiv 2507.09862, [PDF] [Project] [Code] [Dataset] We introduce SpeakerVid-5M, the first large-scale dataset designed specifically for the audio-visual dyadic interactive virtual human task
	Taming Teacher Forcing for Masked Autoregressive Video Generation Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, Heung-Yeung Shum IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2025, [PDF] [Project] [Code] We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation.
	Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer Yu Deng, Duomin Wang, Baoyuan Wang 2024 European Conference on Computer Vision, ECCV 2024, [PDF] [Project] [Code] We learn a lifelike 4D head synthesizer by creating pseudo multi-view videos from monocular ones as supervision.
	PICTURE: PhotorealistIC Virtual Try-on from UnconstRained dEsigns Shuliang Ning, Duomin Wang, Yipeng Qin, Zirong Jin, Baoyuan Wang, Xiaoguang Han 2024 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2024, [PDF] [Project] [Code] [BibTeX] we propose a novel virtual try-on from unconstrained designs (ucVTON) task to enable photorealistic synthesis of personalized composite clothing on input human image.
	Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, Baoyuan Wang 2024 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2024, [PDF] [Project] [Code] [BibTeX] We propose a one-shot 4D head synthesis approach for high-fidelity 4D head avatar reconstruction while trained on large-scale synthetic data.
	Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang 2024 European Conference on Computer Vision, Workshop on EEC, ECCVW 2024, [PDF] [Project] [Code] [BibTeX] We introduce a system that harnesses LLMs to produce a series of detailed text descriptions of the avatar agents' facial motions and then pro- cessed by our task-agnostic driving engine into motion to- ken sequences, which are subsequently converted into con- tinuous motion embeddings that are further consumed by our standalone neural-based renderer to generate the fi- nal photorealistic avatar animations.
	Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors Zhentao Yu, Zixin Yin, Deyu Zhou, Duomin Wang, Finn Wong, Baoyuan Wang 2023 IEEE International Conference on Computer Vision, ICCV 2023, [PDF] [Project] [Code(coming soon)] [BibTeX] We introduce a simple and novel framework for one-shot audio-driven talking head generation. Unlike prior works that require additional driving sources for controlled synthesis in a deterministic manner, we instead probabilistically sample all the holistic lip-irrelevant facial motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the input audio while still maintaining both the photo-realism of audio-lip synchronization and the overall naturalness.
	Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, Baoyuan Wang 2023 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2023, [PDF] [Project] [Code] [BibTeX] We present a novel one-shot talking head synthesis method that achieves disentangled and fine-grained control over lip motion, eye gaze&blink, head pose, and emotional expression. We represent different motions via disentangled latent representations and leverage an image generator to synthesize talking heads from them.

Academic Service

Conference Reviewer WACV (2025), NeurIPS (2025), ACMMM (2025), ICCV (2025), ICLR (2025), CVPR (2025). Journal Reviewer IJCV.

The website template was adapted from Yu Deng.