Frugal Thought Anchors

TL;DR: Identify thought anchors with 91% fewer API calls while still finding 98% of them.

This project provides cost-efficient sampling strategies for detecting "thought anchors" - critical reasoning steps in LLM chain-of-thought (CoT) traces that determine the final answer.

Summary

The Thought Anchors paper identifies important reasoning steps by sampling each chunk 100 times. For 150 chunks, that's 15,000 API calls per problem.

We evaluated five strategies on 40 math problems (6,473 chunks with rollouts, 1,612 thought anchors):

Strategy	Samples	Savings	Recall	Precision	F1	Recommendation
Paper (Baseline)	647,300	—	100%	100%	100%	Ground truth
Pure Adaptive	332,620	48.6%	100%	99.9%	99.9%	✅ When recall matters
Pure Sparse	276,400	57.3%	97.9%	94.0%	95.9%	Good baseline
Hybrid	137,680	78.7%	98.0%	94.1%	96.0%	Good value
Evolved Hybrid	56,750	91.2%	97.7%	87.1%	92.1%	✅ Best savings

Key insight: Most chunks are either clearly important or clearly not. You don't need 100 samples to tell—adaptive early stopping alone saves 49%. Combine with sparse sampling (every 3rd chunk + fill-in) for 79% total savings.

The best algorithm (Evolved Hybrid) was discovered using OpenEvolve, an LLM-driven code evolution framework. It found novel optimizations—shorter sampling schedules, aggressive early-exit thresholds, and distance-aware verification—achieving 91% savings while maintaining 98% recall.

Also tested on safety scenarios (Thought Branches paper):

Blackmail scenarios: 85.5% savings with 92.1% recall
Whistleblower scenarios: 87.4% savings with 94.8% agreement

These results demonstrate the approach generalizes beyond math problems to AI safety analysis.

Robustness check: We verified results are not sensitive to rollout ordering by running all analyses with randomized rollout order—differences were <1% across all metrics.

Terminology note: This repo measures cost in samples (i.e., "how many rollouts you would have generated"), because it operates on already-generated rollouts. Interpreting samples as "API calls" matches the original paper's resampling procedure.

Thanks to Paul Bogdan for suggesting the sparse sampling approach, and to Paul and Uzay Macar for their work on thought anchors generally.

Quick Start

1. Install dependencies

uv sync

2. Download rollout data

The analysis uses pre-generated rollouts from HuggingFace (no new LLM calls are made). The rollout data is large (often multiple GB; the 40-problem subset used here is ~20GB on disk).

If you already have hf_math_rollouts/ locally, you can skip this step. Otherwise, download the same 40-problem evaluation subset with:

uv run python download_problems.py --problems 330 1591 2050 2137 2189 2236 2238 2870 3360 3448 3550 3916 3935 4019 4164 4605 4682 6481 6596 6998

Note: This requires internet access and downloads from huggingface.co/datasets/uzaymacar/math-rollouts.

This downloads rollouts for 20 problem IDs and saves both:

correct_base_solution/ (20 problems)
incorrect_base_solution/ (20 problems)

So the standard evaluation set is 40 problems total. Each problem has ~150 chunks; most chunks have 100 pre-generated rollouts in chunk_*/solutions.json.

Additional options:

uv run python download_problems.py --list-only lists what the HF dataset currently contains.
uv run python download_problems.py --all downloads all available problems in the HF dataset and can be very large.

3. Run the analysis

# Math problems analysis
uv run python run_analysis.py

# Blackmail scenarios analysis (requires separate download)
uv run python download_blackmail.py
uv run python run_blackmail_analysis.py

# Whistleblower scenarios analysis (requires separate download)
uv run python download_whistleblower.py
uv run python run_whistleblower_analysis.py

Background: What Are Thought Anchors?

When an LLM solves a math problem using chain-of-thought reasoning, it generates a long reasoning trace with many steps. Thought anchors are the critical steps that actually matter for the final answer.

The Resampling Method

To identify thought anchors, the paper uses this approach:

Split the reasoning trace into chunks (sentences or paragraphs)
For each chunk:
- Remove it from the trace
- Generate 100 new continuations from that point
- Measure how accuracy changes
Classify based on accuracy change:
- IMPORTANT: Removing this chunk significantly hurts accuracy
- NEGLIGIBLE: Removing this chunk doesn't affect accuracy
- UNCERTAIN: Can't tell with statistical confidence

The Problem: Cost

For a single problem with 150 chunks, the brute-force method requires:

150 chunks × 100 samples = 15,000 API calls

This project explores how to achieve the same results with fewer API calls.

The Five Sampling Strategies

Strategy 1: Paper (Brute Force)

The baseline approach from the original paper.

Aspect	Value
Chunks sampled	All (100%)
Samples per chunk	100 (fixed)
Early stopping	No
Interpolation	No

How it works:

Sample every chunk exactly 100 times
Classify based on bootstrap confidence interval

Strategy 2: Pure Adaptive

Sample every chunk, but stop early when the classification is statistically confident.

Aspect	Value
Chunks sampled	All (100%)
Samples per chunk	10-100 (adaptive)
Early stopping	Yes
Interpolation	No

How it works:

Start with 10 samples for a chunk
Compute bootstrap confidence interval (CI) for importance
If CI decisively classifies the chunk → stop early
Otherwise, add more samples: 20, 30, 50, 70, then 100

Adaptive Schedule: [10, 20, 30, 50, 70, 100]

At each checkpoint, we stop if:

IMPORTANT: CI lower bound > 0 (removing chunk hurts accuracy)
NEGLIGIBLE: CI upper bound < 0 (removing chunk helps or doesn't matter)

Most chunks are clearly important or clearly negligible, so early stopping saves ~49% of samples while maintaining 100% recall.

Strategy 3: Pure Sparse (with Fill-in)

Sample only every Nth chunk, fill in gaps when accuracy changes significantly.

Aspect	Value
Chunks sampled	~33% initially (every 3rd)
Samples per chunk	100 (fixed, no adaptive)
Early stopping	No
Interpolation	Yes (for remaining gaps)

How it works:

Pass 1 - Sparse Sampling:

Chunks:   0   1   2   3   4   5   6   7   8   9   10  11  ...
Sampled:  ✓   -   -   ✓   -   -   ✓   -   -   ✓   -   -   ...

Sample chunks 0, 3, 6, 9, ... (every 3rd) with full 100 samples.

Pass 2 - Fill-in: If accuracy jumps significantly between sparse samples, fill in the gap:

Chunk 3 accuracy: 0.85
Chunk 6 accuracy: 0.45   ← Jump of 0.40 detected!

Fill in chunks 4 and 5 with 100 samples each

Pass 3 - Interpolation: Remaining chunks inherit classification from neighbors (see Interpolation Modes).

Strategy 4: Hybrid (Sparse + Adaptive)

Combines the best of both: skip chunks AND stop early.

Aspect	Value
Chunks sampled	~33% initially (every 3rd)
Samples per chunk	10-100 (adaptive)
Early stopping	Yes
Interpolation	Yes (for remaining gaps)

How it works:

Same three-pass structure as Pure Sparse, but uses adaptive sampling instead of fixed 100 samples:

Pass 1 - Sparse with Adaptive: Sample every 3rd chunk using adaptive early stopping (10→100 samples)

Pass 2 - Fill-in with Adaptive: When accuracy jumps detected, fill in gaps using adaptive sampling

Pass 3 - Interpolation: Remaining chunks inherit from neighbors

Strategy 5: Evolved Hybrid (OpenEvolve-Optimized)

An LLM-evolved algorithm that discovered novel optimizations beyond hand-tuned strategies.

Aspect	Value
Chunks sampled	~20% initially (every 5th)
Samples per chunk	6-80 (aggressive adaptive)
Early stopping	Yes (more aggressive)
Interpolation	Yes (with smart verification)

Key optimizations discovered by OpenEvolve:

Shorter schedule [6, 12, 24, 48, 80] - start earlier, exit faster
Aggressive early exit at mean_effect > 0.12
Lower confidence (0.90 vs 0.95) - tolerates more uncertainty
Fewer bootstrap reps (400-600 vs 1000) - faster decisions
Distance-aware verification - only verify interpolated IMPORTANT chunks when neighbors are ambiguous

See openevolve/README.md for how this algorithm was discovered.

Technical Details

Bootstrap Confidence Interval

Classification uses bootstrap resampling to compute confidence intervals:

def bootstrap_ci(values, confidence=0.95, n_reps=1000):
    # values: list[float] of effects, where effect = baseline - is_correct
    arr = np.array(values)
    rng = np.random.default_rng(42)
    means = [np.mean(rng.choice(arr, size=len(arr), replace=True)) for _ in range(n_reps)]
    alpha = 1 - confidence
    return np.percentile(means, [100 * alpha / 2, 100 * (1 - alpha / 2)])

Classification logic:

If ci_lower > 0: IMPORTANT (removing chunk hurts accuracy)
If ci_upper < 0: NEGLIGIBLE (removing chunk doesn't hurt)
Otherwise: UNCERTAIN

Fill-in Trigger

Fill-in occurs when accuracy changes significantly between sparse samples:

jump_threshold = 0.10  # 10% accuracy change

if abs(accuracy[chunk_i+3] - accuracy[chunk_i]) >= jump_threshold:
    # Fill in chunks i+1 and i+2

Threshold choices:

0.001 (essentially 0) = Fill ALL gaps → maximum accuracy, moderate savings
0.05 = Fill on 5% jumps → good balance
0.10 = Fill on 10% jumps → more aggressive savings

Interpolation Modes

When a chunk isn't sampled, it must be classified by interpolation:

Conservative (recommended):

# If either neighbor is IMPORTANT, classify as IMPORTANT
if prev.classification == "IMPORTANT" or next.classification == "IMPORTANT":
    return "IMPORTANT"
# If neighbors agree, use that classification
elif prev.classification == next.classification:
    return prev.classification
else:
    return "UNCERTAIN"

Nearest:

# Use classification from the closer neighbor
if distance_to_prev <= distance_to_next:
    return prev.classification
else:
    return next.classification

Importance Formula

From the paper, importance measures how much removing a chunk affects accuracy:

importance(chunk) = P(correct | with chunk) - P(correct | without chunk)
                  = baseline_accuracy - accuracy_when_chunk_removed

Where:

baseline_accuracy = 1.0 if base solution is correct, 0.0 if incorrect
accuracy_when_chunk_removed = fraction of rollouts that are correct

Positive importance = chunk helps (anchor candidate) Negative importance = chunk actually hurts accuracy Zero importance = chunk doesn't matter

Results

Math Problems (40 problems, 6,473 chunks, 1,612 thought anchors)

Strategy	Samples	Savings	Recall	Precision	F1
Paper (Brute Force)	647,300	0%	100%	100%	100%
Pure Adaptive	332,620	48.6%	100%	99.9%	99.9%
Pure Sparse	276,400	57.3%	97.9%	94.0%	95.9%
Hybrid	137,680	78.7%	98.0%	94.1%	96.0%
Evolved Hybrid	56,750	91.2%	97.7%	87.1%	92.1%

Blackmail Scenarios (10 scenarios, 935 chunks)

We also tested on blackmail scenarios from the Thought Branches paper (Macar et al.), which analyzes LLM reasoning in ethical decision-making:

Strategy	Samples	Savings	Recall	Precision	F1
Paper (Brute Force)	93,500	0%	100%	100%	100%
Pure Adaptive	78,440	16.1%	100%	100%	100%
Pure Sparse	38,600	58.7%	99.4%	97.4%	98.4%
Hybrid	28,020	70.0%	99.4%	96.9%	98.1%
Evolved Hybrid	13,598	85.5%	92.1%	97.1%	94.3%

Metrics note: For blackmail, NEGLIGIBLE chunks are the "positive" class—these are chunks that cause blackmail behavior (removing them increases safety). Conversely, IMPORTANT means removing the chunk reduces safety (the chunk was preventing blackmail). So Recall = "of chunks causing blackmail, how many did we identify?" This differs from math where IMPORTANT is the positive class. Savings are computed relative to actual rollouts available (not assumed 100 per chunk).

Note: Blackmail scenarios show lower adaptive savings (16.1% vs 48.6%) because the reasoning patterns are more complex—chunk impacts are often ambiguous, requiring more samples to reach statistical confidence.

Whistleblower Scenarios (20 scenarios, 1,040 chunks)

We also tested on whistleblower scenarios from the Thought Branches paper, analyzing LLM reasoning when deciding whether to autonomously report wrongdoing (e.g., reporting clinical trial data falsification to the FDA):

Strategy	Samples	Savings	Agreement	Prec(IMP)	Rec(IMP)	F1(IMP)
Paper (Brute Force)	103,634	0%	100%	100%	100%	100%
Pure Adaptive	64,672	37.6%	100%	100%	100%	100%
Pure Sparse	48,544	53.2%	97.4%	99.0%	95.8%	97.2%
Hybrid	26,404	74.5%	97.5%	99.0%	95.7%	97.2%
Evolved Hybrid	13,022	87.4%	94.8%	87.6%	98.7%	91.9%

Metrics note: Whistleblower scenarios have no NEGLIGIBLE chunks—removing any chunk reduces the probability of whistleblowing behavior. Instead of precision/recall for NEGLIGIBLE, we measure IMPORTANT vs UNCERTAIN classification (where IMPORTANT = removing chunk significantly reduces whistleblowing, UNCERTAIN = effect is ambiguous). Agreement = % of chunks with same classification as paper baseline. Savings are computed relative to actual rollouts available (not assumed 100 per chunk).

Key observations:

The dataset splits 48.5% IMPORTANT / 51.5% UNCERTAIN / 0% NEGLIGIBLE
Evolved Hybrid achieves 98.7% recall on IMPORTANT chunks (rarely misses critical reasoning steps)
The lower precision (87.6%) reflects that with fewer samples, more chunks are conservatively classified as IMPORTANT rather than UNCERTAIN

Key Findings

Pure Adaptive is the safe choice: On math, 48.6% savings with 100% recall/precision. On blackmail and whistleblower, 16-38% savings with 100% agreement (in our runs). Since it samples every chunk, it should theoretically match the paper baseline. The safest option when accuracy matters most.
Hybrid achieves good savings: 78.7% on math, 70.0% on blackmail, 74.5% on whistleblower—consistent savings across domains while maintaining high recall (95%+).
Evolved Hybrid maximizes savings: 91.2% on math, 85.5% on blackmail, 87.4% on whistleblower. Best for cost-sensitive applications where some precision loss is acceptable.
Domain affects savings: Math problems have clearer "important vs not" chunks, enabling more aggressive early stopping. Safety scenarios (blackmail, whistleblower) have more nuanced reasoning patterns.
Different classification challenges: Math/blackmail have NEGLIGIBLE chunks to find; whistleblower has none (all chunks reduce whistleblowing when removed). The strategies adapt to both types of problems.
Results are robust to rollout ordering: We tested whether the order of rollouts within each chunk affects results by running all analyses with randomized rollout order (seed=42). Differences were minimal (<1% for most metrics), confirming no systematic ordering bias in the datasets. The --randomize flag is available in all analysis scripts.

Recommendation

Use Case	Strategy	Math	Blackmail	Whistleblower	Why
Research (100% recall)	Pure Adaptive	48.6%	16.1%	37.6%	Finds ALL anchors
Balanced	Hybrid	78.7%	70.0%	74.5%	Good savings, high recall
Production (cost-sensitive)	Evolved Hybrid	91.2%	85.5%	87.4%	Maximum savings
Quick exploration	Pure Sparse	57.3%	58.7%	53.2%	Simple, no adaptive

File Structure

frugal-thought-anchors/
├── README.md                      # This file
├── pyproject.toml                 # Project dependencies
├── run_analysis.py                # Main analysis script for math problems
├── run_blackmail_analysis.py      # Analysis script for blackmail scenarios
├── run_whistleblower_analysis.py  # Analysis script for whistleblower scenarios
│
├── sampling_strategies/           # Sampling strategy implementations
│   ├── __init__.py                # Package exports
│   ├── common.py                  # Shared types and utilities
│   ├── adaptive.py                # Pure adaptive sampling
│   ├── sparse.py                  # Pure sparse sampling
│   ├── hybrid.py                  # Hybrid (sparse + adaptive)
│   └── evolved_hybrid.py          # OpenEvolve-optimized algorithm
│
├── rollout_loader.py              # Load math rollouts from disk
├── rollout_loader_blackmail.py    # Load blackmail rollouts from disk
├── rollout_loader_whistleblower.py # Load whistleblower rollouts from disk
├── download_problems.py           # Download math rollouts from HuggingFace
├── download_blackmail.py          # Download blackmail rollouts from HuggingFace
├── download_whistleblower.py      # Download whistleblower rollouts from HuggingFace
│
├── openevolve/                    # OpenEvolve algorithm discovery (see openevolve/README.md)
│   └── ...
│
├── hf_math_rollouts/              # Downloaded math rollouts (created by download_problems.py)
├── hf_blackmail_rollouts/         # Downloaded blackmail rollouts (created by download_blackmail.py)
└── hf_whistleblower_rollouts/     # Downloaded whistleblower rollouts (created by download_whistleblower.py)

Usage Examples

Run the full analysis

uv run python run_analysis.py

Save results to JSON

uv run python run_analysis.py --save

Download rollouts

# Download the 40-problem evaluation subset used in this README
uv run python download_problems.py --problems 330 1591 2050 2137 2189 2236 2238 2870 3360 3448 3550 3916 3935 4019 4164 4605 4682 6481 6596 6998

# Download specific problems
uv run python download_problems.py --problems 4682 330 1591

# List available problems
uv run python download_problems.py --list-only

# Download *everything* available in the HF dataset (can be very large)
uv run python download_problems.py --all

Use individual strategies in Python

from rollout_loader import load_problem, MODEL_PATH
from sampling_strategies.adaptive import simulate_problem, OPTIMAL
from sampling_strategies.sparse import sparse_sample_problem, DEFAULT as SPARSE_DEFAULT
from sampling_strategies.hybrid import hybrid_sample_problem, HybridConfig
import sampling_strategies.evolved_hybrid as evolved_hybrid

# Load a problem
problem = load_problem(MODEL_PATH / "correct_base_solution" / "problem_4682")
baseline = 1.0 if problem.base_correct else 0.0

# Run pure adaptive
adaptive_result = simulate_problem(problem, OPTIMAL)
print(f"Adaptive: {adaptive_result.savings_percent:.1f}% savings")

# Run pure sparse
sparse_result = sparse_sample_problem(problem, SPARSE_DEFAULT)
print(f"Sparse: {sparse_result.savings_percent:.1f}% savings")

# Run hybrid with custom config
config = HybridConfig(
    sparse_step=3,
    adaptive_schedule=[10, 20, 30, 50, 70, 100],
    jump_threshold=0.10,
)
hybrid_result = hybrid_sample_problem(problem, config)
print(f"Hybrid: {hybrid_result.savings_percent:.1f}% savings")

# Run evolved hybrid (OpenEvolve-optimized)
evolved_result = evolved_hybrid.sample_problem(problem.chunk_rollouts, baseline)
total_samples = sum(s for _, s, _ in evolved_result.values())
print(f"Evolved: {total_samples} samples")

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
openevolve		openevolve
sampling_strategies		sampling_strategies
.gitignore		.gitignore
README.md		README.md
download_blackmail.py		download_blackmail.py
download_problems.py		download_problems.py
download_whistleblower.py		download_whistleblower.py
pyproject.toml		pyproject.toml
rollout_loader.py		rollout_loader.py
rollout_loader_blackmail.py		rollout_loader_blackmail.py
rollout_loader_whistleblower.py		rollout_loader_whistleblower.py
run_analysis.py		run_analysis.py
run_blackmail_analysis.py		run_blackmail_analysis.py
run_whistleblower_analysis.py		run_whistleblower_analysis.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Frugal Thought Anchors

Summary

Table of Contents

Quick Start

1. Install dependencies

2. Download rollout data

3. Run the analysis

Background: What Are Thought Anchors?

The Resampling Method

The Problem: Cost

The Five Sampling Strategies

Strategy 1: Paper (Brute Force)

Strategy 2: Pure Adaptive

Strategy 3: Pure Sparse (with Fill-in)

Strategy 4: Hybrid (Sparse + Adaptive)

Strategy 5: Evolved Hybrid (OpenEvolve-Optimized)

Technical Details

Bootstrap Confidence Interval

Fill-in Trigger

Interpolation Modes

Importance Formula

Results

Math Problems (40 problems, 6,473 chunks, 1,612 thought anchors)

Blackmail Scenarios (10 scenarios, 935 chunks)

Whistleblower Scenarios (20 scenarios, 1,040 chunks)

Key Findings

Recommendation

File Structure

Usage Examples

Run the full analysis

Save results to JSON

Download rollouts

Use individual strategies in Python

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages