FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization

Yihao Wu1,3 He Zhang1,2* Junbo Tan3 Xueqian Wang3 Zhengyou Zhang1,2
1Tencent Robotics X, 2Futian Laboratory, 3Tsinghua University
*Corresponding author
FlowPRO overview pipeline

Overview of the FlowPRO framework. (a) SFT Base Model: Stage 1 trains πθ on DSFT. (b) Data Collection: operator-triggered rollback and operator teleoperation yield paired positive and negative trajectories w, τl). (c) Preference Dataset: pairs are aggregated into Dprefk. (d) Smooth Interpolation: Bézier interpolation synthesizes the missing counterpart at action-chunk granularity (e.g., MJ, JN′), producing dense per-state tuples. (e) Algorithm (RPRO): aθ, aref from πθ and frozen πref are compared against aw/al to drive rw↑, rl↓, optimized on a mixed batch of Dprefk, Dpref<k, and DSFT; the updated policy is redeployed for K rounds.

Abstract

Post-training Vision-Language-Action (VLA) models into policies that can be reliably deployed on real robots remains a major bottleneck. SFT and DAgger exploit failure signals only indirectly, and reward-based RL is bottlenecked by the difficulty of real-world reward design and of training reliable critics. We present FlowPRO, a reward-free offline reinforced fine-tuning framework for flow-matching VLAs. Algorithmically, we propose RPRO (Robotic Flow-matching Proximalized Preference Optimization), a preference-optimization objective tailored to the flow-matching action head of VLA models. RPRO pairs a contrastive optimizer with an explicit proximal regularizer that anchors the absolute magnitude of the implicit reward, thereby eliminating the reward-hacking failure mode of plain Flow-DPO. On the data side, a teleoperated intervention-and-rollback paradigm produces naturally paired positive and negative trajectories w, τl) on a real robot from a single operator action; a Smooth Interpolation procedure, combined with batch mixing, then converts these sparse corrections into dense per-state supervision while preserving the base policy's capabilities. On four long-horizon bimanual tasks, FlowPRO attains the highest success rate, outperforming four representative baselines, and ablations confirm the contribution of each loss component.

Method Overview

FlowPRO is a two-stage framework. Stage 1 performs supervised fine-tuning on a task-specific dataset to obtain a base policy from a flow-matching VLA backbone. Stage 2 then runs an iterative offline-RL loop on top of it for K rounds. In each round, three components work together:

  • Intervention-and-Rollback Data Collection. A human operator teleoperates corrections only when the policy is about to fail, then rolls the robot back along the recorded trajectory and lets the policy retry. A single operator action produces a naturally paired trajectory w, τl), while the rollback horizon Δ diversifies the per-pair initial state.
  • Smooth Interpolation. Sparse trajectory-level corrections are converted into dense per-state preference tuples by synthesizing the missing counterpart at each state via a smooth Bézier interpolation, preserving the base policy's local distribution.
  • RPRO Loss. The full RPRO objective combines a flow-matching adaptation of PRO with a supervised regression term: LRPRO = λPRO · LPRO + λSFT · LSFT. The proximal regularizer in LPRO anchors the absolute magnitude of the implicit reward at r ≈ 0, eliminating the reward-hacking failure mode of plain Flow-DPO. A gradient-vanishing property further makes preference pairs and SFT demonstrations compatible under a single loss.

Real-Robot Tasks

We evaluate FlowPRO on four long-horizon bimanual tasks on a Dobot XTrainer platform — Pack (cosmetic packaging), Cap (pen-cap assembly), USB (USB insertion), and Case (pencil-case packing) — each evaluated with 100 rollouts on top of two flow-matching VLA backbones (π0 and π0.5).

Real-robot tasks

Real-Robot Demos

Demo videos will be released alongside the paper. The grid below previews where the rollouts of the four tasks will appear.

Pack — cosmetic packaging
Demo coming soon
Cap — pen-cap assembly
Demo coming soon
USB — USB insertion
Demo coming soon
Case — pencil-case packing
Demo coming soon

BibTeX

@inproceedings{wu2026flowpro,
  title     = {FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization},
  author    = {Wu, Yihao and Zhang, He and Tan, Junbo and Wang, Xueqian and Zhang, Zhengyou},
  
  year      = {2026},
  note      = {Under review}
}
Coming soon!