T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

CVPR 2023

Jianrong Zhang1,*Yangsong Zhang2,*Xiaodong Cun3 Shaoli Huang3Yong Zhang3 Hongwei Zhao1Hongtao Lu2Xi Shen3

* Equal Contribution   1 Jilin University   2 Shanghai Jiao Tong University   3 Tencent AI Lab, Shenzhen
teaser.png

Paper GitHub Colab Demo HuggingFace Space Demo Extended Version Page

Abstract


In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation.

Approach


T2M-VQ

T2M-GPT

VQ.png
Transformer.png

Comparison to state-of-the-art


comparison.png

Visual Results (More visual results can be found here)


[1] Guo et al. Generating diverse and natural 3d human motions from text. CVPR, 2022.

[2] Tevet et al. Human motion diffusion model. arXiv, 2022.

[3] Zhang et al. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv, 2022.

Text Prompting: a man steps forward and does a handstand.

T2M [1]

MDM [2]

MotionDiffuse [3]

002103_pred.gif
002103_pred.gif
002103_pred.gif

Ground-Truth

Ours

002103_gt.gif
002103_pred.gif

Text Prompting: A man rises from the ground, walks in a circle and sits back down on the ground.

T2M [1]

MDM [2]

MotionDiffuse [3]

000066_pred.gif
000066_pred.gif
000066_pred.gif

Ground-Truth

Ours

000066_gt.gif
000066_pred.gif

Text Prompting: a person jogs in place, slowly at first, then increases speed. they then back up and squat down.

T2M [1]

MDM [2]

MotionDiffuse [3]

004455_pred.gif
004455_pred.gif
004455_pred.gif

Ground-Truth

Ours

004455_gt.gif
004455_pred.gif

Text Prompting: a man starts off in an up right position with botg arms extended out by his sides, he then brings his arms down to his body and claps his hands together. after this he wals down amd the the left where he proceeds to sit on a seat

T2M [1]

MDM [2]

MotionDiffuse [3]

004742_pred.gif
004742_pred.gif
004742_pred.gif

Ground-Truth

Ours

004742_gt.gif
004742_pred.gif

Failure cases


Text Prompting: a person slightly crouches down and walks forward then back, then around slowly.

Problem: missing some details in the textual description.

Ground-Truth

Ours

000026_gt.gif
000026_pred.gif

Resources


Paper

paper.png

Code

github_repo.png

Colab Demo

demo.png

Space Demo

space_demo.png

BibTeX

If you find this work useful for your research, please cite:
          @inproceedings{zhang2023generating,
            title={T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations},
            author={Zhang, Jianrong and Zhang, Yangsong and Cun, Xiaodong and Huang, Shaoli and Zhang, Yong and Zhao, Hongwei and Lu, Hongtao and Shen, Xi},
            booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
            year={2023},
          }

Acknowledgements


We thank Mathis Petrovich, Yuming Du, Yingyi Chen, Dexiong Chen and Xuelin Chen for inspiring discussions and valuable feedback.

© This webpage was in part inspired from this template.