The Problem with Local Advantage Estimation

Critic-free reinforcement learning with verifiable rewards (RLVR), such as Group Relative Policy Optimization (GRPO), is popular for aligning LLMs because it avoids the memory and compute overhead of training a separate value function (critic). However, these methods rely on prompt-local reward statistics—calculating advantages based solely on the rollouts within a specific prompt group. This approach fails in 'cold-start' regimes where all rollouts in a group receive identical rewards (e.g., all fail or all succeed). In these cases, the within-group reward variance drops to zero, causing group normalization to yield zero advantages and effectively stalling the learning process.

The BV-Blend Solution

BV-Blend addresses this instability by augmenting prompt-local statistics with historical context. Instead of relying exclusively on the current batch, the framework maintains Exponential Moving Average (EMA) tracked reward moments for specific semantic clusters.

Key components of the approach include:

  • Semantic-Cluster-Conditioned Moments: The model tracks reward statistics across historical data, grouped by semantic similarity, providing a stable reference point when local variance is insufficient.
  • Confidence-Weighted Blending: The system calculates a confidence weight using a Standard Error of the Mean (SEM) proxy. This weight determines the ratio between the prompt-local statistics and the historical moments.
  • Standardized Advantage: By blending these sources, the model generates a more robust advantage estimate for PPO-style clipped updates, ensuring that learning continues even when local reward signals are uniform or noisy.

Impact on Training

By incorporating historical baselines, BV-Blend prevents the training stalls common in binary verifier environments. Empirical results on verifiable reasoning benchmarks demonstrate that the method improves both training stability and overall performance, making it a more robust alternative to standard group-normalized methods in scenarios where reward signals are sparse or unreliable.