TL;DR: Identify thought anchors with 91% fewer API calls while still finding 98% of them.
This project provides cost-efficient sampling strategies for detecting "thought anchors" - critical reasoning steps in LLM chain-of-thought (CoT) traces that determine the final answer.
The Thought Anchors paper identifies important reasoning steps by sampling each chunk 100 times. For 150 chunks, that's 15,000 API calls per problem.
We evaluated five strategies on 40 math problems (6,473 chunks with rollouts, 1,612 thought anchors):
| Strategy | Samples | Savings | Recall | Precision | F1 | Recommendation |
|---|---|---|---|---|---|---|
| Paper (Baseline) | 647,300 | — | 100% | 100% | 100% | Ground truth |
| Pure Adaptive | 332,620 | 48.6% | 100% | 99.9% | 99.9% | ✅ When recall matters |
| Pure Sparse | 276,400 | 57.3% | 97.9% | 94.0% | 95.9% | Good baseline |
| Hybrid | 137,680 | 78.7% | 98.0% | 94.1% | 96.0% | Good value |
| Evolved Hybrid | 56,750 | 91.2% | 97.7% | 87.1% | 92.1% | ✅ Best savings |
Key insight: Most chunks are either clearly important or clearly not. You don't need 100 samples to tell—adaptive early stopping alone saves 49%. Combine with sparse sampling (every 3rd chunk + fill-in) for 79% total savings.
The best algorithm (Evolved Hybrid) was discovered using OpenEvolve, an LLM-driven code evolution framework. It found novel optimizations—shorter sampling schedules, aggressive early-exit thresholds, and distance-aware verification—achieving 91% savings while maintaining 98% recall.
Also tested on safety scenarios (Thought Branches paper):
- Blackmail scenarios: 85.5% savings with 92.1% recall
- Whistleblower scenarios: 87.4% savings with 94.8% agreement
These results demonstrate the approach generalizes beyond math problems to AI safety analysis.
Robustness check: We verified results are not sensitive to rollout ordering by running all analyses with randomized rollout order—differences were <1% across all metrics.
Terminology note: This repo measures cost in samples (i.e., "how many rollouts you would have generated"), because it operates on already-generated rollouts. Interpreting samples as "API calls" matches the original paper's resampling procedure.
Thanks to Paul Bogdan for suggesting the sparse sampling approach, and to Paul and Uzay Macar for their work on thought anchors generally.
- Quick Start
- Background: What Are Thought Anchors?
- The Five Sampling Strategies
- Technical Details
- Results
- File Structure
- Usage Examples
uv syncThe analysis uses pre-generated rollouts from HuggingFace (no new LLM calls are made). The rollout data is large (often multiple GB; the 40-problem subset used here is ~20GB on disk).
If you already have hf_math_rollouts/ locally, you can skip this step. Otherwise, download the same 40-problem evaluation subset with:
uv run python download_problems.py --problems 330 1591 2050 2137 2189 2236 2238 2870 3360 3448 3550 3916 3935 4019 4164 4605 4682 6481 6596 6998Note: This requires internet access and downloads from
huggingface.co/datasets/uzaymacar/math-rollouts.
This downloads rollouts for 20 problem IDs and saves both:
correct_base_solution/(20 problems)incorrect_base_solution/(20 problems)
So the standard evaluation set is 40 problems total. Each problem has ~150 chunks; most chunks have 100 pre-generated rollouts in chunk_*/solutions.json.
Additional options:
uv run python download_problems.py --list-onlylists what the HF dataset currently contains.uv run python download_problems.py --alldownloads all available problems in the HF dataset and can be very large.
# Math problems analysis
uv run python run_analysis.py
# Blackmail scenarios analysis (requires separate download)
uv run python download_blackmail.py
uv run python run_blackmail_analysis.py
# Whistleblower scenarios analysis (requires separate download)
uv run python download_whistleblower.py
uv run python run_whistleblower_analysis.pyWhen an LLM solves a math problem using chain-of-thought reasoning, it generates a long reasoning trace with many steps. Thought anchors are the critical steps that actually matter for the final answer.
To identify thought anchors, the paper uses this approach:
- Split the reasoning trace into chunks (sentences or paragraphs)
- For each chunk:
- Remove it from the trace
- Generate 100 new continuations from that point
- Measure how accuracy changes
- Classify based on accuracy change:
- IMPORTANT: Removing this chunk significantly hurts accuracy
- NEGLIGIBLE: Removing this chunk doesn't affect accuracy
- UNCERTAIN: Can't tell with statistical confidence
For a single problem with 150 chunks, the brute-force method requires:
150 chunks × 100 samples = 15,000 API calls
This project explores how to achieve the same results with fewer API calls.
The baseline approach from the original paper.
| Aspect | Value |
|---|---|
| Chunks sampled | All (100%) |
| Samples per chunk | 100 (fixed) |
| Early stopping | No |
| Interpolation | No |
How it works:
- Sample every chunk exactly 100 times
- Classify based on bootstrap confidence interval
Sample every chunk, but stop early when the classification is statistically confident.
| Aspect | Value |
|---|---|
| Chunks sampled | All (100%) |
| Samples per chunk | 10-100 (adaptive) |
| Early stopping | Yes |
| Interpolation | No |
How it works:
- Start with 10 samples for a chunk
- Compute bootstrap confidence interval (CI) for importance
- If CI decisively classifies the chunk → stop early
- Otherwise, add more samples: 20, 30, 50, 70, then 100
Adaptive Schedule: [10, 20, 30, 50, 70, 100]
At each checkpoint, we stop if:
- IMPORTANT: CI lower bound > 0 (removing chunk hurts accuracy)
- NEGLIGIBLE: CI upper bound < 0 (removing chunk helps or doesn't matter)
Most chunks are clearly important or clearly negligible, so early stopping saves ~49% of samples while maintaining 100% recall.
Sample only every Nth chunk, fill in gaps when accuracy changes significantly.
| Aspect | Value |
|---|---|
| Chunks sampled | ~33% initially (every 3rd) |
| Samples per chunk | 100 (fixed, no adaptive) |
| Early stopping | No |
| Interpolation | Yes (for remaining gaps) |
How it works:
Pass 1 - Sparse Sampling:
Chunks: 0 1 2 3 4 5 6 7 8 9 10 11 ...
Sampled: ✓ - - ✓ - - ✓ - - ✓ - - ...
Sample chunks 0, 3, 6, 9, ... (every 3rd) with full 100 samples.
Pass 2 - Fill-in: If accuracy jumps significantly between sparse samples, fill in the gap:
Chunk 3 accuracy: 0.85
Chunk 6 accuracy: 0.45 ← Jump of 0.40 detected!
Fill in chunks 4 and 5 with 100 samples each
Pass 3 - Interpolation: Remaining chunks inherit classification from neighbors (see Interpolation Modes).
Combines the best of both: skip chunks AND stop early.
| Aspect | Value |
|---|---|
| Chunks sampled | ~33% initially (every 3rd) |
| Samples per chunk | 10-100 (adaptive) |
| Early stopping | Yes |
| Interpolation | Yes (for remaining gaps) |
How it works:
Same three-pass structure as Pure Sparse, but uses adaptive sampling instead of fixed 100 samples:
Pass 1 - Sparse with Adaptive: Sample every 3rd chunk using adaptive early stopping (10→100 samples)
Pass 2 - Fill-in with Adaptive: When accuracy jumps detected, fill in gaps using adaptive sampling
Pass 3 - Interpolation: Remaining chunks inherit from neighbors
An LLM-evolved algorithm that discovered novel optimizations beyond hand-tuned strategies.
| Aspect | Value |
|---|---|
| Chunks sampled | ~20% initially (every 5th) |
| Samples per chunk | 6-80 (aggressive adaptive) |
| Early stopping | Yes (more aggressive) |
| Interpolation | Yes (with smart verification) |
Key optimizations discovered by OpenEvolve:
- Shorter schedule
[6, 12, 24, 48, 80]- start earlier, exit faster - Aggressive early exit at mean_effect > 0.12
- Lower confidence (0.90 vs 0.95) - tolerates more uncertainty
- Fewer bootstrap reps (400-600 vs 1000) - faster decisions
- Distance-aware verification - only verify interpolated IMPORTANT chunks when neighbors are ambiguous
See openevolve/README.md for how this algorithm was discovered.
Classification uses bootstrap resampling to compute confidence intervals:
def bootstrap_ci(values, confidence=0.95, n_reps=1000):
# values: list[float] of effects, where effect = baseline - is_correct
arr = np.array(values)
rng = np.random.default_rng(42)
means = [np.mean(rng.choice(arr, size=len(arr), replace=True)) for _ in range(n_reps)]
alpha = 1 - confidence
return np.percentile(means, [100 * alpha / 2, 100 * (1 - alpha / 2)])Classification logic:
- If
ci_lower > 0: IMPORTANT (removing chunk hurts accuracy) - If
ci_upper < 0: NEGLIGIBLE (removing chunk doesn't hurt) - Otherwise: UNCERTAIN
Fill-in occurs when accuracy changes significantly between sparse samples:
jump_threshold = 0.10 # 10% accuracy change
if abs(accuracy[chunk_i+3] - accuracy[chunk_i]) >= jump_threshold:
# Fill in chunks i+1 and i+2Threshold choices:
0.001(essentially 0) = Fill ALL gaps → maximum accuracy, moderate savings0.05= Fill on 5% jumps → good balance0.10= Fill on 10% jumps → more aggressive savings
When a chunk isn't sampled, it must be classified by interpolation:
Conservative (recommended):
# If either neighbor is IMPORTANT, classify as IMPORTANT
if prev.classification == "IMPORTANT" or next.classification == "IMPORTANT":
return "IMPORTANT"
# If neighbors agree, use that classification
elif prev.classification == next.classification:
return prev.classification
else:
return "UNCERTAIN"Nearest:
# Use classification from the closer neighbor
if distance_to_prev <= distance_to_next:
return prev.classification
else:
return next.classificationFrom the paper, importance measures how much removing a chunk affects accuracy:
importance(chunk) = P(correct | with chunk) - P(correct | without chunk)
= baseline_accuracy - accuracy_when_chunk_removed
Where:
baseline_accuracy= 1.0 if base solution is correct, 0.0 if incorrectaccuracy_when_chunk_removed= fraction of rollouts that are correct
Positive importance = chunk helps (anchor candidate) Negative importance = chunk actually hurts accuracy Zero importance = chunk doesn't matter
| Strategy | Samples | Savings | Recall | Precision | F1 |
|---|---|---|---|---|---|
| Paper (Brute Force) | 647,300 | 0% | 100% | 100% | 100% |
| Pure Adaptive | 332,620 | 48.6% | 100% | 99.9% | 99.9% |
| Pure Sparse | 276,400 | 57.3% | 97.9% | 94.0% | 95.9% |
| Hybrid | 137,680 | 78.7% | 98.0% | 94.1% | 96.0% |
| Evolved Hybrid | 56,750 | 91.2% | 97.7% | 87.1% | 92.1% |
We also tested on blackmail scenarios from the Thought Branches paper (Macar et al.), which analyzes LLM reasoning in ethical decision-making:
| Strategy | Samples | Savings | Recall | Precision | F1 |
|---|---|---|---|---|---|
| Paper (Brute Force) | 93,500 | 0% | 100% | 100% | 100% |
| Pure Adaptive | 78,440 | 16.1% | 100% | 100% | 100% |
| Pure Sparse | 38,600 | 58.7% | 99.4% | 97.4% | 98.4% |
| Hybrid | 28,020 | 70.0% | 99.4% | 96.9% | 98.1% |
| Evolved Hybrid | 13,598 | 85.5% | 92.1% | 97.1% | 94.3% |
Metrics note: For blackmail, NEGLIGIBLE chunks are the "positive" class—these are chunks that cause blackmail behavior (removing them increases safety). Conversely, IMPORTANT means removing the chunk reduces safety (the chunk was preventing blackmail). So Recall = "of chunks causing blackmail, how many did we identify?" This differs from math where IMPORTANT is the positive class. Savings are computed relative to actual rollouts available (not assumed 100 per chunk).
Note: Blackmail scenarios show lower adaptive savings (16.1% vs 48.6%) because the reasoning patterns are more complex—chunk impacts are often ambiguous, requiring more samples to reach statistical confidence.
We also tested on whistleblower scenarios from the Thought Branches paper, analyzing LLM reasoning when deciding whether to autonomously report wrongdoing (e.g., reporting clinical trial data falsification to the FDA):
| Strategy | Samples | Savings | Agreement | Prec(IMP) | Rec(IMP) | F1(IMP) |
|---|---|---|---|---|---|---|
| Paper (Brute Force) | 103,634 | 0% | 100% | 100% | 100% | 100% |
| Pure Adaptive | 64,672 | 37.6% | 100% | 100% | 100% | 100% |
| Pure Sparse | 48,544 | 53.2% | 97.4% | 99.0% | 95.8% | 97.2% |
| Hybrid | 26,404 | 74.5% | 97.5% | 99.0% | 95.7% | 97.2% |
| Evolved Hybrid | 13,022 | 87.4% | 94.8% | 87.6% | 98.7% | 91.9% |
Metrics note: Whistleblower scenarios have no NEGLIGIBLE chunks—removing any chunk reduces the probability of whistleblowing behavior. Instead of precision/recall for NEGLIGIBLE, we measure IMPORTANT vs UNCERTAIN classification (where IMPORTANT = removing chunk significantly reduces whistleblowing, UNCERTAIN = effect is ambiguous). Agreement = % of chunks with same classification as paper baseline. Savings are computed relative to actual rollouts available (not assumed 100 per chunk).
Key observations:
- The dataset splits 48.5% IMPORTANT / 51.5% UNCERTAIN / 0% NEGLIGIBLE
- Evolved Hybrid achieves 98.7% recall on IMPORTANT chunks (rarely misses critical reasoning steps)
- The lower precision (87.6%) reflects that with fewer samples, more chunks are conservatively classified as IMPORTANT rather than UNCERTAIN
-
Pure Adaptive is the safe choice: On math, 48.6% savings with 100% recall/precision. On blackmail and whistleblower, 16-38% savings with 100% agreement (in our runs). Since it samples every chunk, it should theoretically match the paper baseline. The safest option when accuracy matters most.
-
Hybrid achieves good savings: 78.7% on math, 70.0% on blackmail, 74.5% on whistleblower—consistent savings across domains while maintaining high recall (95%+).
-
Evolved Hybrid maximizes savings: 91.2% on math, 85.5% on blackmail, 87.4% on whistleblower. Best for cost-sensitive applications where some precision loss is acceptable.
-
Domain affects savings: Math problems have clearer "important vs not" chunks, enabling more aggressive early stopping. Safety scenarios (blackmail, whistleblower) have more nuanced reasoning patterns.
-
Different classification challenges: Math/blackmail have NEGLIGIBLE chunks to find; whistleblower has none (all chunks reduce whistleblowing when removed). The strategies adapt to both types of problems.
-
Results are robust to rollout ordering: We tested whether the order of rollouts within each chunk affects results by running all analyses with randomized rollout order (seed=42). Differences were minimal (<1% for most metrics), confirming no systematic ordering bias in the datasets. The
--randomizeflag is available in all analysis scripts.
| Use Case | Strategy | Math | Blackmail | Whistleblower | Why |
|---|---|---|---|---|---|
| Research (100% recall) | Pure Adaptive | 48.6% | 16.1% | 37.6% | Finds ALL anchors |
| Balanced | Hybrid | 78.7% | 70.0% | 74.5% | Good savings, high recall |
| Production (cost-sensitive) | Evolved Hybrid | 91.2% | 85.5% | 87.4% | Maximum savings |
| Quick exploration | Pure Sparse | 57.3% | 58.7% | 53.2% | Simple, no adaptive |
frugal-thought-anchors/
├── README.md # This file
├── pyproject.toml # Project dependencies
├── run_analysis.py # Main analysis script for math problems
├── run_blackmail_analysis.py # Analysis script for blackmail scenarios
├── run_whistleblower_analysis.py # Analysis script for whistleblower scenarios
│
├── sampling_strategies/ # Sampling strategy implementations
│ ├── __init__.py # Package exports
│ ├── common.py # Shared types and utilities
│ ├── adaptive.py # Pure adaptive sampling
│ ├── sparse.py # Pure sparse sampling
│ ├── hybrid.py # Hybrid (sparse + adaptive)
│ └── evolved_hybrid.py # OpenEvolve-optimized algorithm
│
├── rollout_loader.py # Load math rollouts from disk
├── rollout_loader_blackmail.py # Load blackmail rollouts from disk
├── rollout_loader_whistleblower.py # Load whistleblower rollouts from disk
├── download_problems.py # Download math rollouts from HuggingFace
├── download_blackmail.py # Download blackmail rollouts from HuggingFace
├── download_whistleblower.py # Download whistleblower rollouts from HuggingFace
│
├── openevolve/ # OpenEvolve algorithm discovery (see openevolve/README.md)
│ └── ...
│
├── hf_math_rollouts/ # Downloaded math rollouts (created by download_problems.py)
├── hf_blackmail_rollouts/ # Downloaded blackmail rollouts (created by download_blackmail.py)
└── hf_whistleblower_rollouts/ # Downloaded whistleblower rollouts (created by download_whistleblower.py)
uv run python run_analysis.pyuv run python run_analysis.py --save# Download the 40-problem evaluation subset used in this README
uv run python download_problems.py --problems 330 1591 2050 2137 2189 2236 2238 2870 3360 3448 3550 3916 3935 4019 4164 4605 4682 6481 6596 6998
# Download specific problems
uv run python download_problems.py --problems 4682 330 1591
# List available problems
uv run python download_problems.py --list-only
# Download *everything* available in the HF dataset (can be very large)
uv run python download_problems.py --allfrom rollout_loader import load_problem, MODEL_PATH
from sampling_strategies.adaptive import simulate_problem, OPTIMAL
from sampling_strategies.sparse import sparse_sample_problem, DEFAULT as SPARSE_DEFAULT
from sampling_strategies.hybrid import hybrid_sample_problem, HybridConfig
import sampling_strategies.evolved_hybrid as evolved_hybrid
# Load a problem
problem = load_problem(MODEL_PATH / "correct_base_solution" / "problem_4682")
baseline = 1.0 if problem.base_correct else 0.0
# Run pure adaptive
adaptive_result = simulate_problem(problem, OPTIMAL)
print(f"Adaptive: {adaptive_result.savings_percent:.1f}% savings")
# Run pure sparse
sparse_result = sparse_sample_problem(problem, SPARSE_DEFAULT)
print(f"Sparse: {sparse_result.savings_percent:.1f}% savings")
# Run hybrid with custom config
config = HybridConfig(
sparse_step=3,
adaptive_schedule=[10, 20, 30, 50, 70, 100],
jump_threshold=0.10,
)
hybrid_result = hybrid_sample_problem(problem, config)
print(f"Hybrid: {hybrid_result.savings_percent:.1f}% savings")
# Run evolved hybrid (OpenEvolve-optimized)
evolved_result = evolved_hybrid.sample_problem(problem.chunk_rollouts, baseline)
total_samples = sum(s for _, s, _ in evolved_result.values())
print(f"Evolved: {total_samples} samples")