Flexible Locomotion Learning with Diffusion Model Predictive Control

Runhan Huang Haldun Balim Heng Yang Yilun Du^†

† Corresponding author

In Submission

Abstract

Legged locomotion demands controllers that are both robust and adaptable, while remaining compatible with task and safety considerations. However, model-free reinforcement learning (RL) methods often yield a fixed policy that can be difficult to adapt to new behaviors at test time. In contrast, Model Predictive Control (MPC) provides a natural approach to flexible behavior synthesis by incorporating different objectives and constraints directly into its optimization process. However, classical MPC relies on accurate dynamics models, which are often difficult to obtain in complex environments and typically require simplifying assumptions. We present Diffusion-MPC, which leverages a learned generative diffusion model as an approximate dynamics prior for planning, enabling flexible test-time adaptation through reward and constraint based optimization. Diffusion-MPC jointly predicts future states and actions; at each reverse step, we incorporate reward planning and impose constraint projection, yielding trajectories that satisfy task objectives while remaining within physical limits. To obtain a planning model that adapts beyond imitation pretraining, we introduce an interactive training algorithm for diffusion based planner: we execute our reward-and-constraint planner in environment, then filter and reweight the collected trajectories by their realized returns before updating the denoiser. Our design enables strong test-time adaptability, allowing the planner to adjust to new reward specifications without retraining. We validate Diffusion-MPC on real world, demonstrating strong locomotion and flexible adaptation.

Method

Flexible Behavior Synthesis Through Sampling

Rather than directly fitting an action only policy, our diffusion model learns to jointly represent state transitions and action proposals from large, heterogeneous datasets. This learned generative prior then plays the role of the planner in an MPC framework: during each planning cycle, trajectories are sampled from the diffusion model and optimized with reward terms and constraints, effectively performing model-based planning without reliance on hand-crafted dynamics. In this view, diffusion models are not just conditional generators, but expressive approximators of environment dynamics that make tractable, flexible MPC possible. Reward-based planning updates steer generated trajectories toward task objectives, while feasibility is maintained through constraint projection. Candidate ranking is then applied to further refine the selected plan. Together, these mechanisms provide adaptability while avoiding the need for simplified model designs required by MPC and the rigidity of fixed RL policies.

Compositional Behavior Synthesis

Beyond planning with a single reward function, our framework also supports flexible skill composition through reward combination. Let $\{R_i(\tau)\}_{i=1}^K$ denote a set of scalar task rewards, which may be either neural or analytic. At deployment, the user specifies weights $\alpha \in \mathbb{R}^K$ to form a composite objective

\[ R_{\alpha}(\tau) = \sum_{i=1}^{K} \alpha_i\,R_i(\tau). \]

By varying the weights $\alpha_i$, Diffusion MPC can seamlessly trade off between different objectives, synthesizing a diverse range of behaviors. This includes not only behaviors represented in the dataset but also novel behaviors arising from new combinations of reward signals.

Planner Learning with Environment Interaction

We propose a strategy to collect data and finetune the diffusion prior using trajectories generated by our model in an online interactive way. Let $(\tau^{(k)}, k, \epsilon)$ denote a standard denoising tuple constructed from a clean trajectory $\tau$ with forward noise $\epsilon \sim \mathcal{N}(0,I)$. Per-trajectory weights are defined from realized returns as

\[ w\!\left(R_r(\tau)\right)=\exp\!\left(\frac{R_r(\tau)}{T}\right), \]

with $T>0$ as a temperature parameter. Here $R_r$ denotes the ground-truth return from the environment, analogous to the RL setting, rather than the reward model used during planning. To filter out low-return rollouts, we retain only the top-$K$ weights and set the rest to zero:

\[ w'(\tau) = \begin{cases} w(R_r(\tau)), & \tau \in \text{Top-}K, \\ 0, & \text{otherwise.} \end{cases} \]

\[ \bar w(\tau) = \frac{w'(\tau)}{\mathbb{E}[w'(\tau)]}. \]

The resulting objective is

\[ \mathcal{L}_{\text{RWD}}(\theta)=\mathbb{E}\Big[\,\bar w\!\left(R_r(\tau)\right)\,\big\|\,\tau-\tau_{\theta}(\tau^{(k)},k)\big\|_2^2\Big], \]

which performs exponentially tilted regression, biasing updates toward higher-return trajectories.A replay buffer is maintained that interleaves on-policy planner rollouts with previously collected trajectories, preserving the coverage of prior experiences while nudging the model toward reward-favored regions of the trajectory space.

Real-time Planning

Asynchronous planning for real-time control. To meet high-rate locomotion requirements, we employ an asynchronous pipeline with planning horizon $H$ and replan margin $D$. At each timestep $t$, the controller executes the next action from the current $H$-step plan $a_{t:t+H-1}$. When the execution index reaches $H{-}D$, we trigger replanning from the latest observation to synthesize a fresh $H$-step plan while continuing to execute the remaining $D$ buffered actions from the old plan. Once these $D$ actions have been applied, we time-align the new plan by skipping its first $D$ actions and begin execution at offset $D$. Equivalently, each action is computed $D$ control cycles before it is applied (a $D$-step action buffer), which maintains real-time operation while preserving closed-loop feedback with period $H{-}D$ steps.

Caching early denoising. Successive plans generated by our model often produce nearly identical trajectories in early diffusion steps, as these steps primarily denoise without incorporating task-specific structure. To avoid redundant computation, we shift the existing plan across time steps and reuse it as the initialization for the next window, up to $m$ steps. This warm-start strategy preserves solution quality while substantially reducing inference cost.

Sampler choice and step budget. DDIM offers faster, deterministic sampling at some cost in fidelity, while DDPM is slower but higher quality. The number of denoising steps controls the compute–quality trade-off. We use $10$ DDPM steps at inference and ablate both the training horizon and test-time step count.

Experiments

Adaptation Performance of Diffusion-MPC Planner

For adaptation tasks, we consider locomotion tasks with objectives: base height variation, joint limit restriction, energy saving, joint acceleration/velocity regularization, and balancing. Metrics report penalties for different task components, with each penalty type scaled independently for clarity. Smaller penalties indicate closer adherence to the desired behavior. Results are shown as a function of candidate number (Cand), reward-based planning (R), and constraint enforcement (C).

Real-World Adaptation Tasks

Adaptation capability is evaluated across four representative tasks: energy saving, joint position regulation, height variation, and dynamic balancing.

Real-World Locomotion

Diffusion-MPC is evaluated on challenging real-world terrains, including soft uneven grass with varying friction and a grass slope with varying inclination. The planner is deployed in a zero-shot manner without environment-specific retraining. A neural-network-based foot-lifting reward model encourages stable stepping on uneven surfaces, while a balancing reward enhances stability during traversal. For slope locomotion, regularization on the rear-calf joint position is applied adaptively: larger angles are favored for ascending slopes to prevent backward slipping, whereas smaller angles are encouraged for descending slopes to maintain forward stability. These results highlight that diffusion-based planning enables deployment in the wild, providing both adaptability to diverse terrains and flexible behavior modulation at test time.

Interactive Learning Experiments

Our interactive learning algorithm enables a weak planner to be finetuned into a robust controller, and further demonstrates that even a planner trained entirely from scratch can acquire effective behaviors. This highlights that diffusion-based planning can be learned without dependence on demonstration data, broadening its applicability in real-world settings.

BibTeX

@article{huang2025flexible,
    title={Flexible Locomotion Learning with Diffusion Model Predictive Control},
    author={Huang, Runhan and Balim, Haldun and Yang, Heng and Du, Yilun},
    journal={arXiv preprint arXiv:2510.04234},
    year={2025}
    }