PipelineRL uses vLLM as the inference engine for rollout generation. The inference engine samples the token and returns the token logprob. The trainer uses these logprobs to calculate policy ratio, KL, clip rate, entropy, and reward. Inconsistencies in how these logprob are calculated can change the training dynamics. This is a train inference mismatch that needed to be eliminated during the transition from vLLM V0 to V1.
TL;DR. vLLM V1 matched the vLLM V0 reference after fixing four things: processed rollout logprobs, V1-specific runtime defaults, in-flight weight update path, and fp32 lm_head used for final projection. We fixed the backend behavior before repurposing the RL.
Reference runs used vLLM 0.8.5. V1 runs using vLLM 0.18.1. Figure 1 shows the final result. The red run is the first V1 run, and the green run is the last V1 run after the modifications described below.

Purpose of migration
vLLM V1 is a major rewrite of the V0 engine. Therefore, the migration targets were intentionally narrowed.
Verify that V1 returned rollout log probes in the format expected by the trainer Rerun the same workload against the V0 reference Evaluate target level changes only after backend parity is restored
The first visible symptoms are:
Clamp_log_ratio_new_old_indicator kl_new_old Entropy reward
These metrics were obtained from the GSPO training run, which is the objective used in this experiment. The same class of mismatches can also surface in online RL systems that treat logprob on the PPO, GRPO, or rollout side as part of the optimization target.
The first V1 run showed the problem clearly. Trainer-side logprobs and rewards were moved away from the V0 reference during the early stages of training.

The same pattern appears in trainer metrics. Clip rate is the easiest signal to read for the first comparison.

failure mode
We have divided the possible causes into three tiers:
Semantics mismatch: The backend returns logprob with a different meaning than what the trainer expects. Inference path mismatch: The same prompt follows different execution paths because the backends use different runtime defaults for caching, scheduling, or request processing. Goal mismatches: RL goals require fixing the amount of remaining staleness or backend mismatches.
We initially suspected that the third category was premature. A useful diagnosis was obtained by treating the first two as backend behavior issues and ruling them out first.
V1 backend fixes
Logprob semantics
The first problem was semantics. vLLM V1 returns logprobs from the raw model output by default before any logical post-processing such as temperature scaling, penalties, top-k/top-p filtering, etc. PipelineRL was expecting logprobs from the processed distribution used by the sampler.
The required settings are:
logprobs-mode=processed_logprobs
This removed the apparent average offset in the rollout logprob. The training curve still showed a gap compared to a known good reference, so the next problem had to be in the inference path.
The policy ratio plot shows this directly. When processed_logprobs is turned on in V1, the average policy ratio stays centered very close to 1.0 for all three runs. This establishes a mean bias correction. The remaining discrepancies appear in clip rate, KL, entropy, and downstream training behavior.

runtime default
Early V1 runs had a mix of engine versions and V1 runtime defaults.
Prefix cache. It was left unset in the initial run, and asynchronous scheduling was applied by default in vLLM 0.18.1. Left unset in initial runs, vLLM 0.18.1 defaulted to an ad-hoc disable-cascade-attn override set through kwarg passthrough at boot time and outside of the committed configuration’s parity recipe.
For parity runs, we explicitly made the following choices:
vllm_config:
use_v1: truth
vllm_kwargs:
Log probe mode: Processed log probes
Enable prefix caching: error
Asynchronous scheduling: error
Prefix caching requires a separate explanation. This is typically an accurate inference optimization for a fixed model state. In this online RL setup, the only difference was V1 in cache lifetime and reuse compared to the V0 reference path. The attacker also handled repeated prefixes, concurrent requests, asynchronous scheduling, and in-flight weight updates.
If the cache policy ignores weight update boundaries, prefix cache hits can reuse the state computed before the weight update. Disabling the prefix cache removed one V1-only degree of freedom from the parity comparison.
Latest information on cabin weight
Weight synchronization also had to match the online RL update model. One option was to make V1 more strict than V0 by draining requests and clearing the cache on every update. That will answer another question. First, we needed to make sure that V1 matched the existing V0 behavior.
What V0 effectively did was more like:
Block execution load at engine boundary, new weights resume without explicit cache state invalidation
The closest V1 analogs are:
wait Engine.pause_generation(mode=“keep”clear cache=error)
wait Engine_client.collective_rpc_async(
“receive_weight_update”args=(request.model_dump_json(),), )
wait engine.resume_generation()
Two details are important.
mode=”keep” closely matches the older in-flight update model than wait or abort. clear_cache=False matches the behavior of the V0 wrapper, which leaves cached state intact on updates.
The lag was useful for runtime diagnostics. The initial V1 pass has a more persistent lag after training than in the modified V1 run.

Remaining gap: fp32 lm_head
The fixes to the V1 backend mentioned above fixed the obvious migration issue, but the final parity still needed to match the numerical path used to calculate the logit. The trainer used fp32 lm_head for the final projection. The rollout backend had to match that behavior.
A closely related issue is described in the MiniMax-M1 technical report. The RL run showed a training/inference token probability mismatch, which was tracked up to the LM output head and corrected by computing the head in fp32.
This is important because RL updates directly consume the token logprob. Small changes in the logit become visible in the policy ratio, KL, and clipping. Therefore, the final projection accuracy becomes part of the accuracy surface of online RL. The ScaleRL paper later included fp32 logit/head calculations as part of the RL recipe, which was later deprecated as a useful design choice for large-scale RL.
Since the fp32 lm_head path is included, reward displays the final parity result in a compact manner. In Figure 6, the final V1 execution tracks the V0 reference. The first V1 attempt produces a distinctly different reward curve.

Ablation
Negative results are important because they rule out general explanations.
processed_logprobs only: Fixed bug in semantic logprob. Training discrepancies remained. Batch invariance: Separate tests still showed inconsistencies, high delays, high clip rates, and NCCL complications. Treat the first V1 run as a fair baseline: The first V1 run had multiple V1-only defaults enabled, leading to confusing migration comparisons.
Why we fixed the backend correctness in the first place
Objective-side modifications such as truncated importance sampling, importance ratio reweighting, and related techniques are useful tools. If a rollout is intentionally stale, generated asynchronously, or generated by a backend that doesn’t have the equivalent of a trainer-side policy, it’s often appropriate to add some form of fix.
The first issue here was the correctness of the inference. After migrating to V1, the rollout backend returned log probes and runtime behavior that violated the trainer’s expectations. Adding objective corrections at this point brings two questions into the mix:
Is the inference backend generating the correct logprobs? Given the correct logprobs, does the goal still require off-policy or asynchronous fixes?
Those questions need to be separated. Otherwise, corrections on the objective side may compensate for broken inference backend behavior, making the training curve difficult to interpret.
Current goals can still be improved. The next improvement after inference parity is restored is the usual async/off policy cleanup.
Retain log probes for explicit behavioral policies during rollout Recalculate log probes for old policies on the trainer side during optimization Separate backend discrepancy compensation from policy update rate Track diagnostics such as ESS for compensation terms in parallel with aggregate trainer metrics
The main lesson from this transition is narrower. First fix the correctness of the backend, then add fixes for any remaining discrepancies.

