We study when RLVR generalizes under three weak supervision settings (scarce data with as few as 8 examples, noisy reward labels, and proxy rewards such as majority vote and self-certainty) across multiple models from the Qwen and Llama families on three reasoning domains: Math, Science, and Graph.
Finding We find that generalization is governed by saturation dynamics: models progress through a pre-saturation phase where training reward steadily increases and the model learns transferable reasoning, followed by a post-saturation phase where reward plateaus and further training yields diminishing returns. Models with extended pre-saturation phases (Qwen on Math and Science) generalize from as few as 8 examples, tolerate significant label noise, and even work with proxy rewards. Rapidly saturating models (Llama across all domains) fail across all three settings.
Root Cause The root cause of failure is unfaithful reasoning, not lack of diversity. Failing models maximize training reward by memorizing answers while producing reasoning traces that do not logically support their final answers, despite maintaining high output diversity.
We study three settings where supervision is imperfect: scarce data (as few as 8 examples), noisy reward labels, and self-supervised proxy rewards. The findings below span multiple models from the Qwen and Llama families across Math, Science, and Graph reasoning domains.
How does data scarcity affect RLVR generalization? We train with as few as 8 examples across different models and domains, tracking saturation dynamics — the point at which training reward plateaus and learning effectively stops.
The same pattern emerges across all three domains: models with extended pre-saturation phases generalize from as few as 8 samples, while rapidly saturating models require substantially more data. This is domain-dependent — even Qwen-Math saturates faster on Graph, where pretraining exposure is low.
When ground-truth verifiers are imperfect, reward labels may contain errors. We corrupt a fraction γ of training labels and measure how robustly each model-domain pair generalizes. Drag the slider to increase corruption and watch the curves degrade.
Llama-3B-Instruct on Math shows progressive degradation with increasing label corruption — MATH-500 drops from ~51% at γ = 0.1 to ~42% at γ = 0.9. Models that saturate faster are generally less robust to noise: Llama memorizes incorrect answers just as easily as correct ones. Qwen-Math-7B on Graph tolerates low corruption but degrades at γ ≥ 0.5.
When ground-truth verifiers are entirely unavailable, models must rely on alternative reward signals. We compare RLVR (ground-truth) against two proxy rewards: majority vote (consensus among sampled responses) and self-certainty (model confidence). Select a model–domain pair to see how each performs under prolonged training.
Self-supervised proxy rewards are brittle and model-dependent. Qwen-3B with majority vote shows temporary gains before collapsing after ~500 steps. Llama-3B-Instruct reward-hacks majority vote to 1.0 as MATH-500 collapses from 45% to 4%. Self-certainty collapses in both models. Only math-specialized models (Qwen-Math) show stable improvement with proxy rewards (see Figure 22 in the paper).
A natural hypothesis: failing models lack output diversity — they can't explore enough. But this is wrong. Llama maintains higher diversity than Qwen, yet performs worse. The real explanation is unfaithful reasoning: Llama produces correct final answers with chain-of-thought traces that do not logically support them.
Low reasoning faithfulness explains why some models fail under weak supervision: they memorize answers rather than learn transferable reasoning, leading to rapid saturation. Raw diversity is misleading — it should always be evaluated jointly with faithfulness.
Continual pre-training on domain-specific data combined with supervised fine-tuning on explicit reasoning traces before RL recovers generalization across all three weak supervision settings.
Supervised fine-tuning on explicit reasoning traces before RL improves reasoning faithfulness, extends the pre-saturation phase, and enables generalization under all three weak supervision settings. Continual pre-training further amplifies the effect, achieving the strongest gains across both in-domain and out-of-domain benchmarks. See Figure 7 in the paper for how CPT + Thinking SFT improves faithfulness compared to other configurations.
@article{rahman2026when,
title = {When Can LLMs Learn to Reason with Weak Supervision?},
author = {Rahman, Salman and Shen, Jingyan and Mordvina, Anna and
Palangi, Hamid and Gabriel, Saadia and Izmailov, Pavel},
journal = {Preprint},
year = {2026}
}