T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

CVPR 2023

Jianrong Zhang^1,* Yangsong Zhang^2,* Xiaodong Cun³ Shaoli Huang³ Yong Zhang³ Hongwei Zhao¹ Hongtao Lu² Xi Shen³

^* Equal Contribution ¹ Jilin University ² Shanghai Jiao Tong University ³ Tencent AI Lab, Shenzhen

Paper GitHub Colab Demo HuggingFace Space Demo Extended Version Page

Abstract

In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation.

Approach

T2M-VQ

T2M-GPT

Comparison to state-of-the-art

Visual Results (More visual results can be found here)

[1] Guo et al. Generating diverse and natural 3d human motions from text. CVPR, 2022.

[2] Tevet et al. Human motion diffusion model. arXiv, 2022.

[3] Zhang et al. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv, 2022.

Text Prompting: a man steps forward and does a handstand.

T2M ^[1]

MDM ^[2]

MotionDiffuse ^[3]

Ground-Truth

Ours

Text Prompting: A man rises from the ground, walks in a circle and sits back down on the ground.

T2M ^[1]

MDM ^[2]

MotionDiffuse ^[3]

Ground-Truth

Ours

Text Prompting: a person jogs in place, slowly at first, then increases speed. they then back up and squat down.

T2M ^[1]

MDM ^[2]

MotionDiffuse ^[3]

Ground-Truth

Ours

Text Prompting: a man starts off in an up right position with botg arms extended out by his sides, he then brings his arms down to his body and claps his hands together. after this he wals down amd the the left where he proceeds to sit on a seat

T2M ^[1]

MDM ^[2]

MotionDiffuse ^[3]

Ground-Truth

Ours

Failure cases

Text Prompting: a person slightly crouches down and walks forward then back, then around slowly.

Problem: missing some details in the textual description.

Ground-Truth

Ours

Resources

Paper

Code

Colab Demo

Space Demo

BibTeX

If you find this work useful for your research, please cite:

          @inproceedings{zhang2023generating,
            title={T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations},
            author={Zhang, Jianrong and Zhang, Yangsong and Cun, Xiaodong and Huang, Shaoli and Zhang, Yong and Zhao, Hongwei and Lu, Hongtao and Shen, Xi},
            booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
            year={2023},
          }

Acknowledgements

We thank Mathis Petrovich, Yuming Du, Yingyi Chen, Dexiong Chen and Xuelin Chen for inspiring discussions and valuable feedback.

T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

CVPR 2023

Jianrong Zhang1,* Yangsong Zhang2,* Xiaodong Cun3 Shaoli Huang3 Yong Zhang3 Hongwei Zhao1 Hongtao Lu2 Xi Shen3

Paper GitHub Colab Demo HuggingFace Space Demo Extended Version Page

Abstract

Approach

T2M-VQ

T2M-GPT

Comparison to state-of-the-art

Visual Results (More visual results can be found here)

Text Prompting: a man steps forward and does a handstand.

T2M [1]

MDM [2]

MotionDiffuse [3]

Ground-Truth

Ours

Text Prompting: A man rises from the ground, walks in a circle and sits back down on the ground.

T2M [1]

MDM [2]

MotionDiffuse [3]

Ground-Truth

Ours

Text Prompting: a person jogs in place, slowly at first, then increases speed. they then back up and squat down.

T2M [1]

MDM [2]

MotionDiffuse [3]

Ground-Truth

Ours

Text Prompting: a man starts off in an up right position with botg arms extended out by his sides, he then brings his arms down to his body and claps his hands together. after this he wals down amd the the left where he proceeds to sit on a seat

T2M [1]

MDM [2]

MotionDiffuse [3]

Ground-Truth

Ours

Failure cases

Text Prompting: a person slightly crouches down and walks forward then back, then around slowly.

Problem: missing some details in the textual description.

Ground-Truth

Ours

Resources

Paper

Code

Colab Demo

Space Demo

BibTeX

Acknowledgements

Jianrong Zhang^1,* Yangsong Zhang^2,* Xiaodong Cun³ Shaoli Huang³ Yong Zhang³ Hongwei Zhao¹ Hongtao Lu² Xi Shen³

T2M ^[1]

MDM ^[2]

MotionDiffuse ^[3]

T2M ^[1]

MDM ^[2]

MotionDiffuse ^[3]

T2M ^[1]

MDM ^[2]

MotionDiffuse ^[3]

T2M ^[1]

MDM ^[2]

MotionDiffuse ^[3]