Pre-correction accuracy in RL

PipelineRL uses vLLM as the inference engine for rollout generation. The inference engine samples the token and returns the token logprob. The trainer uses these logprobs to calculate policy ratio, KL, clip rate, entropy, and reward. Inconsistencies in how these logprob are calculated can change the training dynamics. This is a train inference mismatch that needed to be eliminated during the transition from vLLM V0 to V1.

TL;DR. vLLM V1 matched the vLLM V0 reference after fixing four things: processed rollout logprobs, V1-specific runtime defaults, in-flight weight update path, and fp32 lm_head used for final projection. We fixed the backend behavior before repurposing the RL.

Reference runs used vLLM 0.8.5. V1 runs using vLLM 0.18.1. Figure 1 shows the final result. The red run is the first V1 run, and the green run is the last V1 run after the modifications described below.

Figure 1. Trainer-side metrics for vLLM V0 reference (blue), first vLLM V1 attempt (red), and final vLLM V1 run after modification (green), including fp32 lm_head. The final V1 run is close to the V0 trajectory across clip rate, KL, entropy, and reward.

Purpose of migration

vLLM V1 is a major rewrite of the V0 engine. Therefore, the migration targets were intentionally narrowed.

Verify that V1 returned rollout log probes in the format expected by the trainer Rerun the same workload against the V0 reference Evaluate target level changes only after backend parity is restored

The first visible symptoms are:

Clamp_log_ratio_new_old_indicator kl_new_old Entropy reward

These metrics were obtained from the GSPO training run, which is the objective used in this experiment. The same class of mismatches can also surface in online RL systems that treat logprob on the PPO, GRPO, or rollout side as part of the optimization target.

The first V1 run showed the problem clearly. Trainer-side logprobs and rewards were moved away from the V0 reference during the early stages of training.

Figure 2. Current policy logprobs (left) and rewards (right) computed by the trainer during update. The first vLLM V1 run (red) is separated from the vLLM V0 reference (blue).

The same pattern appears in trainer metrics. Clip rate is the easiest signal to read for the first comparison.

Figure 3. Trainer side metrics for vLLM V0 reference (blue) and first vLLM V1 trial (red). Clip rate tracks rollout/trainer policy gaps. Entropy and reward show how that gap propagates into training.

failure mode

We have divided the possible causes into three tiers:

Semantics mismatch: The backend returns logprob with a different meaning than what the trainer expects. Inference path mismatch: The same prompt follows different execution paths because the backends use different runtime defaults for caching, scheduling, or request processing. Goal mismatches: RL goals require fixing the amount of remaining staleness or backend mismatches.

We initially suspected that the third category was premature. A useful diagnosis was obtained by treating the first two as backend behavior issues and ruling them out first.

V1 backend fixes

Logprob semantics

The first problem was semantics. vLLM V1 returns logprobs from the raw model output by default before any logical post-processing such as temperature scaling, penalties, top-k/top-p filtering, etc. PipelineRL was expecting logprobs from the processed distribution used by the sampler.

The required settings are:

logprobs-mode=processed_logprobs

This removed the apparent average offset in the rollout logprob. The training curve still showed a gap compared to a known good reference, so the next problem had to be in the inference path.

The policy ratio plot shows this directly. When processed_logprobs is turned on in V1, the average policy ratio stays centered very close to 1.0 for all three runs. This establishes a mean bias correction. The remaining discrepancies appear in clip rate, KL, entropy, and downstream training behavior.

Figure 4. Step-by-step deviation of rollout/trainer policy ratio from 1.0 (scaled by 10,000) for vLLM V0 reference (blue), initial vLLM V1 run (red), and modified vLLM V1 run (green).

runtime default

Early V1 runs had a mix of engine versions and V1 runtime defaults.

Prefix cache. It was left unset in the initial run, and asynchronous scheduling was applied by default in vLLM 0.18.1. Left unset in initial runs, vLLM 0.18.1 defaulted to an ad-hoc disable-cascade-attn override set through kwarg passthrough at boot time and outside of the committed configuration’s parity recipe.

For parity runs, we explicitly made the following choices:

vllm_config:
use_v1: truth
vllm_kwargs:
Log probe mode: Processed log probes
Enable prefix caching: error
Asynchronous scheduling: error

Prefix caching requires a separate explanation. This is typically an accurate inference optimization for a fixed model state. In this online RL setup, the only difference was V1 in cache lifetime and reuse compared to the V0 reference path. The attacker also handled repeated prefixes, concurrent requests, asynchronous scheduling, and in-flight weight updates.

If the cache policy ignores weight update boundaries, prefix cache hits can reuse the state computed before the weight update. Disabling the prefix cache removed one V1-only degree of freedom from the parity comparison.

Latest information on cabin weight

Weight synchronization also had to match the online RL update model. One option was to make V1 more strict than V0 by draining requests and clearing the cache on every update. That will answer another question. First, we needed to make sure that V1 matched the existing V0 behavior.

What V0 effectively did was more like:

Block execution load at engine boundary, new weights resume without explicit cache state invalidation

The closest V1 analogs are:

wait Engine.pause_generation(mode=“keep”clear cache=error)
wait Engine_client.collective_rpc_async(
“receive_weight_update”args=(request.model_dump_json(),), )
wait engine.resume_generation()

Two details are important.

mode=”keep” closely matches the older in-flight update model than wait or abort. clear_cache=False matches the behavior of the V0 wrapper, which leaves cached state intact on updates.

The lag was useful for runtime diagnostics. The initial V1 pass has a more persistent lag after training than in the modified V1 run.

Figure 5. Number of steps by which the weights in the rollout server lag behind the trainer policy for the vLLM V0 reference (blue), the first vLLM V1 run (red), and the modified vLLM V1 run (green).

Remaining gap: fp32 lm_head

The fixes to the V1 backend mentioned above fixed the obvious migration issue, but the final parity still needed to match the numerical path used to calculate the logit. The trainer used fp32 lm_head for the final projection. The rollout backend had to match that behavior.

A closely related issue is described in the MiniMax-M1 technical report. The RL run showed a training/inference token probability mismatch, which was tracked up to the LM output head and corrected by computing the head in fp32.

This is important because RL updates directly consume the token logprob. Small changes in the logit become visible in the policy ratio, KL, and clipping. Therefore, the final projection accuracy becomes part of the accuracy surface of online RL. The ScaleRL paper later included fp32 logit/head calculations as part of the RL recipe, which was later deprecated as a useful design choice for large-scale RL.

Since the fp32 lm_head path is included, reward displays the final parity result in a compact manner. In Figure 6, the final V1 execution tracks the V0 reference. The first V1 attempt produces a distinctly different reward curve.

Figure 6. Reward for vLLM V0 reference (blue), first vLLM V1 attempt (red), and final vLLM V1 run (green) using fp32 lm_head path. Since the fp32 head is included, the final V1 run will track the V0 reference.

Ablation

Negative results are important because they rule out general explanations.

processed_logprobs only: Fixed bug in semantic logprob. Training discrepancies remained. Batch invariance: Separate tests still showed inconsistencies, high delays, high clip rates, and NCCL complications. Treat the first V1 run as a fair baseline: The first V1 run had multiple V1-only defaults enabled, leading to confusing migration comparisons.

Why we fixed the backend correctness in the first place

Objective-side modifications such as truncated importance sampling, importance ratio reweighting, and related techniques are useful tools. If a rollout is intentionally stale, generated asynchronously, or generated by a backend that doesn’t have the equivalent of a trainer-side policy, it’s often appropriate to add some form of fix.

The first issue here was the correctness of the inference. After migrating to V1, the rollout backend returned log probes and runtime behavior that violated the trainer’s expectations. Adding objective corrections at this point brings two questions into the mix:

Is the inference backend generating the correct logprobs? Given the correct logprobs, does the goal still require off-policy or asynchronous fixes?

Those questions need to be separated. Otherwise, corrections on the objective side may compensate for broken inference backend behavior, making the training curve difficult to interpret.

Current goals can still be improved. The next improvement after inference parity is restored is the usual async/off policy cleanup.

Retain log probes for explicit behavioral policies during rollout Recalculate log probes for old policies on the trainer side during optimization Separate backend discrepancy compensation from policy update rate Track diagnostics such as ESS for compensation terms in parallel with aggregate trainer metrics

The main lesson from this transition is narrower. First fix the correctness of the backend, then add fixes for any remaining discrepancies.

versatileai

See Full Bio

What's Hot

Pre-correction accuracy in RL

HP and AI and data technology for the enterprise

Add Benchmaxxer Repellent to Open ASR Leaderboard

HP and AI and data technology for the enterprise

Add Benchmaxxer Repellent to Open ASR Leaderboard

Agentic AI governance is now a product. Is your company ready?

DeepInfra on Hug Face Inference Provider 🔥

Per-token AI fees coming to GitHub Copilot

How enterprise AI governance ensures profit margins

Most Popular

DeepInfra on Hug Face Inference Provider 🔥

Per-token AI fees coming to GitHub Copilot

How enterprise AI governance ensures profit margins

Don't Miss

Pre-correction accuracy in RL

HP and AI and data technology for the enterprise

Add Benchmaxxer Repellent to Open ASR Leaderboard

Subscribe to Updates

What's Hot

Pre-correction accuracy in RL

Purpose of migration

failure mode

V1 backend fixes

Logprob semantics

runtime default

Latest information on cabin weight

Remaining gap: fp32 lm_head

Ablation

Why we fixed the backend correctness in the first place

Related Posts