VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

Jun Zheng1    Fuwei Zhao2    Youjiang Xu2    Xin Dong2    Xiaodan Liang1   
1Shenzhen Campus of Sun Yat-Sen University  
2ByteDance  

Abstract

Video try-on stands as a promising area for its tremendous real-world potential. Prior works are limited to transferring product clothing images onto person videos with simple poses and backgrounds, while underperforming on casually captured videos. Recently, Sora revealed the scalability of Diffusion Transformer (DiT) in generating lifelike videos featuring real-world scenarios. Inspired by this, we explore and propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT. Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet. To faithfully recover the clothing details, the extracted garment features are fused with the self-attention outputs of the denoising DiT and the ControlNet. We also introduce novel random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation. Unlike existing attempts that require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability, VITON-DiT alleviates this by relying solely on unpaired human dance videos and a carefully designed multi-stage training strategy. Furthermore, we curate a challenging benchmark dataset to evaluate the performance of casual video try-on. Extensive experiments demonstrate the superiority of VITON-DiT in generating spatio-temporal consistent try-on results for in-the-wild videos with complicated human poses.

Adapting to Different Clothing Styles and Scenes

VITON-DiT can handle complex clothing and backgrounds even with less accurate pose conditions.

Simple Clothing and Static Backgrounds

Challenging Clothing and In-the-Wild Scenes

<<<<<<< HEAD =======
>>>>>>> 248ea13

Adapting to Different Perspective and Human Poses

Unlike previous video try-on methods limited to simple human poses and slow movement, Our VITON-DiT can perform try-on tasks for unusual viewpoints and complicated poses.

Simple Poses and Slow Motion

Rare Perspective and Complex Body Movements

Framework

Overview of the proposed VITON-DiT.

(a) The architecture contains three components with the following tasks. (1) Denoising DiT : generating latent representation of video contents via a chain of Spatio-Temporal (ST-) DiT blocks. (2) ID ControlNet : producing feature residual for the Denoising DiT to preserve the reference person's identity, pose, and background. (3) Garment Extractor : obtaining and delivering garment features into the Denoising DiT and the ControlNet via Attention Fusion, thus recovering detailed clothing textures in the generated try-on video.

(b) Illustrated Attention Fusion: integrating person denoising features and extracted garment features using additive attention. This operation is utilized in both the Denoising DiT and the ID ControlNet.

BibTeX

@article{zheng2024vitondit,
      title={VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers},
      author={Zheng, Jun and Zhao, Fuwei and Xu, Youjiang and Dong, Xin and Liang, Xiaodan},
      journal={arXiv preprint},
      year={2024}
    }