Causal Motion Diffusion Models for Autoregressive Motion Generation
Abstract
Causal Motion Diffusion Models introduce a unified framework for autoregressive motion generation using a causal diffusion transformer in a semantically aligned latent space, enabling fast, high-quality text-to-motion synthesis with improved temporal smoothness.
Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.
Community
Proposes Causal Motion Diffusion Models (CMDM) using a causal diffusion transformer and MAC-VAE latent space for autoregressive, streaming, long-horizon text-to-motion with improved fidelity and reduced latency.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction (2026)
- Causality in Video Diffusers is Separable from Denoising (2026)
- Causality-Aware Temporal Projection for Video Understanding in Video-LLMs (2026)
- Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion (2026)
- TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation (2026)
- PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance (2026)
- Pathwise Test-Time Correction for Autoregressive Long Video Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper