ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation

Youhe Feng ^1,*

Hansen Shi ¹

Haoyang Li ¹

Xinlei Guo ¹

Yang Wang ¹

Chengyang Zhang ¹

Jinkai Zhang ¹

Xiaohan Zhang ²

Jie Tang ³

Jing Zhang ^1,†

¹ School of Information, Renmin University of China ² Zhipu AI

³Department of Computer Science and Technology, Tsinghua University ^*Work done during the internship at Zhipu AI. ^†Corresponding author: zhang-jing@ruc.edu.cn

Paper Code arXiv Model Dataset

Abstract

Long-horizon robotic manipulation requires dense feedback that reflects how a task advances through its procedural stages, not merely whether the final outcome is successful. Existing reward models often rely on trajectory-level success labels or time-based interpolation, which can conflate elapsed time with true task progress and fail to capture unfinished steps, stagnation, and failure states.

We present ProcVLM, a progress-aware vision-language model that learns procedure-grounded progress as a dense reward signal for manipulation. Rather than deriving progress from terminal outcomes or temporal proxies, ProcVLM grounds progress estimation in procedural structure and intra-stage visual change. It further adopts a reasoning-before-estimation paradigm, first inferring the remaining atomic actions before estimating task progress.

To construct supervision, we synthesize frame-level subtask-semantic annotations, assign progress budgets according to subtask structure, and distribute each budget based on intra-subtask visual change. This pipeline produces ProcCorpus-60M from 30 embodied datasets with 60M annotated frames, from which we derive ProcVQA for procedure-aware pretraining. Experiments show that ProcVLM improves embodied procedural reasoning and yields more discriminative trajectory-internal progress estimates than representative baselines.

Method

ProcVLM is built from a scalable procedural supervision synthesis pipeline. A large VLM annotator converts raw robot trajectories into frame-wise labels of subtask stages, completion states, and remaining actions. These annotations form ProcCorpus-60M, a corpus of about 400K real-robot and simulated trajectories with over 60M annotated frames.

From these annotations, we construct ProcVQA, a procedure-aware VQA training set with three task families: action segmentation, future planning, and task progress prediction. Progress targets are defined from subtask structure and intra-subtask visual change, so the supervision reflects procedural advancement rather than elapsed time.

ProcVLM is trained to jointly perform textual procedure reasoning and continuous progress estimation. A shared VLM backbone generates reasoning-formatted responses, while a dedicated progress value head regresses scalar completion scores, enabling ProcVLM to serve as a dense reward model for reward-guided policy optimization.

ProcVLM method overview — Overview of ProcVLM: VLM-based trajectory annotation produces ProcCorpus-60M and ProcVQA, enabling procedure-aware training for action segmentation, future planning, and progress-based reward estimation.

Cases

The following examples visualize trajectory-internal progress estimates across zero-shot and one-shot manipulation tasks.

Zero-shot cases

One-shot cases

Zero-shot reward editing

Instruct: put the apple into the basket.

Instruct: put the apple into the basket and move the basket to the upper corner.

Results

We evaluate ProcVLM as a progress-based reward model for reward fine-tuning. Its dense progress scores are used by Evo-RL to estimate advantages for advantage-conditioned policy training. The following real-robot examples from a JAKA stack-bowls task illustrate representative policy behaviors under different training settings.

Without Advantage Condition

Case 1: the policy overfits to local regrasping actions: it may retry even after a successful grasp and become trapped in local action ambiguities.

Case 2: the policy gets stuck after grasping the bowl, oscillating between conflicting motion patterns instead of progressing toward placement.

With Advantage Condition

Case 3 & 4: By selecting positive-advantage samples, advantage conditioning downweights local action patterns that provide little task progress, enabling more effective learning from noisy human-teleoperated data.

Citation

@misc{feng2026procvlmlearningproceduregroundedprogress,
      title={ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation},
      author={Youhe Feng and Hansen Shi and Haoyang Li and Xinlei Guo and Yang Wang and Chengyang Zhang and Jinkai Zhang and Xiaohan Zhang and Jie Tang and Jing Zhang},
      year={2026},
      eprint={2605.08774},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.08774},
}