When Can LLMs Learn to Reason with Weak Supervision?

Summary

We study when RLVR generalizes under three weak supervision settings (scarce data with as few as 8 examples, noisy reward labels, and proxy rewards such as majority vote and self-certainty) across multiple models from the Qwen and Llama families on three reasoning domains: Math, Science, and Graph.

Finding We find that generalization is governed by saturation dynamics: models progress through a pre-saturation phase where training reward steadily increases and the model learns transferable reasoning, followed by a post-saturation phase where reward plateaus and further training yields diminishing returns. Models with extended pre-saturation phases (Qwen on Math and Science) generalize from as few as 8 examples, tolerate significant label noise, and even work with proxy rewards. Rapidly saturating models (Llama across all domains) fail across all three settings.

Root Cause The root cause of failure is unfaithful reasoning, not lack of diversity. Failing models maximize training reward by memorizing answers while producing reasoning traces that do not logically support their final answers, despite maintaining high output diversity.

The Fix

Continual pre-training on domain-specific data combined with supervised fine-tuning on explicit reasoning traces before RL improves faithfulness, extends the pre-saturation phase, and recovers generalization across all three weak supervision settings.

Section 1

RLVR Under Weak Supervision

We study three settings where supervision is imperfect: scarce data (as few as 8 examples), noisy reward labels, and self-supervised proxy rewards. The findings below span multiple models from the Qwen and Llama families across Math, Science, and Graph reasoning domains.

Scarce data

How does data scarcity affect RLVR generalization? We train with as few as 8 examples across different models and domains, tracking saturation dynamics — the point at which training reward plateaus and learning effectively stops.

Training samples (N) 8

Qwen-Math-1.5B Qwen-1.5B Llama-3B-Instruct saturation step

Training reward

MATH-500 (%)

out-of-domain science benchmark

Qwen-Math-1.5B

302

k_sat — saturation step

Qwen-1.5B

158

k_sat — saturation step

Llama-3B-Instruct

55

k_sat — saturation step

N = 8: Qwen-Math-1.5B sustains learning for 342 steps (35%→67% MATH-500). Qwen-1.5B saturates at step 172. Llama-3B-Instruct saturates earliest at step 60.

Takeaway

The same pattern emerges across all three domains: models with extended pre-saturation phases generalize from as few as 8 samples, while rapidly saturating models require substantially more data. This is domain-dependent — even Qwen-Math saturates faster on Graph, where pretraining exposure is low.

Noisy rewards

When ground-truth verifiers are imperfect, reward labels may contain errors. We corrupt a fraction γ of training labels and measure how robustly each model-domain pair generalizes. Drag the slider to increase corruption and watch the curves degrade.

Takeaway

Llama-3B-Instruct on Math shows progressive degradation with increasing label corruption — MATH-500 drops from ~51% at γ = 0.1 to ~42% at γ = 0.9. Models that saturate faster are generally less robust to noise: Llama memorizes incorrect answers just as easily as correct ones. Qwen-Math-7B on Graph tolerates low corruption but degrades at γ ≥ 0.5.

Self-supervised proxy rewards

When ground-truth verifiers are entirely unavailable, models must rely on alternative reward signals. We compare RLVR (ground-truth) against two proxy rewards: majority vote (consensus among sampled responses) and self-certainty (model confidence). Select a model–domain pair to see how each performs under prolonged training.

Majority vote

Self-certainty

Majority vote Self-certainty

Training reward

MATH-500 (%)

out-of-domain science benchmark

Takeaway

Self-supervised proxy rewards are brittle and model-dependent. Qwen-3B with majority vote shows temporary gains before collapsing after ~500 steps. Llama-3B-Instruct reward-hacks majority vote to 1.0 as MATH-500 collapses from 45% to 4%. Self-certainty collapses in both models. Only math-specialized models (Qwen-Math) show stable improvement with proxy rewards (see Figure 22 in the paper).

Section 2

Why Do Some Models Fail?

A natural hypothesis: failing models lack output diversity — they can't explore enough. But this is wrong. Llama maintains higher diversity than Qwen, yet performs worse. The real explanation is unfaithful reasoning: Llama produces correct final answers with chain-of-thought traces that do not logically support them.

Qwen-Math-1.5B Qwen-1.5B Llama-3B-Instruct

Semantic diversity

Shannon diversity index over all responses

Llama has higher diversity

Reasoning faithfulness

Fraction of responses with aligned chain-of-thought

Llama has much lower faithfulness

Takeaway

Low reasoning faithfulness explains why some models fail under weak supervision: they memorize answers rather than learn transferable reasoning, leading to rapid saturation. Raw diversity is misleading — it should always be evaluated jointly with faithfulness.

Section 3

Making Llama Generalize Under Weak Supervision

Continual pre-training on domain-specific data combined with supervised fine-tuning on explicit reasoning traces before RL recovers generalization across all three weak supervision settings.

Llama-3B → CPT (52B math tokens) → Thinking SFT (43.5K traces) → RL (GRPO)

Training reward

MATH-500 (%)

out-of-domain science benchmark

Takeaway

Supervised fine-tuning on explicit reasoning traces before RL improves reasoning faithfulness, extends the pre-saturation phase, and enables generalization under all three weak supervision settings. Continual pre-training further amplifies the effect, achieving the strongest gains across both in-domain and out-of-domain benchmarks. See Figure 7 in the paper for how CPT + Thinking SFT improves faithfulness compared to other configurations.

When Can LLMs Learn to Reasonwith Weak Supervision?

Summary

RLVR Under Weak Supervision

Scarce data

Noisy rewards

Self-supervised proxy rewards

Why Do Some Models Fail?

Making Llama Generalize Under Weak Supervision

BibTeX

When Can LLMs Learn to Reason
with Weak Supervision?