Skip to content

vabruzzo/frugal-thought-anchors

Repository files navigation

Frugal Thought Anchors

TL;DR: Identify thought anchors with 91% fewer API calls while still finding 98% of them.

This project provides cost-efficient sampling strategies for detecting "thought anchors" - critical reasoning steps in LLM chain-of-thought (CoT) traces that determine the final answer.

Summary

The Thought Anchors paper identifies important reasoning steps by sampling each chunk 100 times. For 150 chunks, that's 15,000 API calls per problem.

We evaluated five strategies on 40 math problems (6,473 chunks with rollouts, 1,612 thought anchors):

Strategy Samples Savings Recall Precision F1 Recommendation
Paper (Baseline) 647,300 100% 100% 100% Ground truth
Pure Adaptive 332,620 48.6% 100% 99.9% 99.9% ✅ When recall matters
Pure Sparse 276,400 57.3% 97.9% 94.0% 95.9% Good baseline
Hybrid 137,680 78.7% 98.0% 94.1% 96.0% Good value
Evolved Hybrid 56,750 91.2% 97.7% 87.1% 92.1% Best savings

Key insight: Most chunks are either clearly important or clearly not. You don't need 100 samples to tell—adaptive early stopping alone saves 49%. Combine with sparse sampling (every 3rd chunk + fill-in) for 79% total savings.

The best algorithm (Evolved Hybrid) was discovered using OpenEvolve, an LLM-driven code evolution framework. It found novel optimizations—shorter sampling schedules, aggressive early-exit thresholds, and distance-aware verification—achieving 91% savings while maintaining 98% recall.

Also tested on safety scenarios (Thought Branches paper):

  • Blackmail scenarios: 85.5% savings with 92.1% recall
  • Whistleblower scenarios: 87.4% savings with 94.8% agreement

These results demonstrate the approach generalizes beyond math problems to AI safety analysis.

Robustness check: We verified results are not sensitive to rollout ordering by running all analyses with randomized rollout order—differences were <1% across all metrics.

Terminology note: This repo measures cost in samples (i.e., "how many rollouts you would have generated"), because it operates on already-generated rollouts. Interpreting samples as "API calls" matches the original paper's resampling procedure.

Thanks to Paul Bogdan for suggesting the sparse sampling approach, and to Paul and Uzay Macar for their work on thought anchors generally.


Table of Contents

  1. Quick Start
  2. Background: What Are Thought Anchors?
  3. The Five Sampling Strategies
  4. Technical Details
  5. Results
  6. File Structure
  7. Usage Examples

Quick Start

1. Install dependencies

uv sync

2. Download rollout data

The analysis uses pre-generated rollouts from HuggingFace (no new LLM calls are made). The rollout data is large (often multiple GB; the 40-problem subset used here is ~20GB on disk).

If you already have hf_math_rollouts/ locally, you can skip this step. Otherwise, download the same 40-problem evaluation subset with:

uv run python download_problems.py --problems 330 1591 2050 2137 2189 2236 2238 2870 3360 3448 3550 3916 3935 4019 4164 4605 4682 6481 6596 6998

Note: This requires internet access and downloads from huggingface.co/datasets/uzaymacar/math-rollouts.

This downloads rollouts for 20 problem IDs and saves both:

  • correct_base_solution/ (20 problems)
  • incorrect_base_solution/ (20 problems)

So the standard evaluation set is 40 problems total. Each problem has ~150 chunks; most chunks have 100 pre-generated rollouts in chunk_*/solutions.json.

Additional options:

  • uv run python download_problems.py --list-only lists what the HF dataset currently contains.
  • uv run python download_problems.py --all downloads all available problems in the HF dataset and can be very large.

3. Run the analysis

# Math problems analysis
uv run python run_analysis.py

# Blackmail scenarios analysis (requires separate download)
uv run python download_blackmail.py
uv run python run_blackmail_analysis.py

# Whistleblower scenarios analysis (requires separate download)
uv run python download_whistleblower.py
uv run python run_whistleblower_analysis.py

Background: What Are Thought Anchors?

When an LLM solves a math problem using chain-of-thought reasoning, it generates a long reasoning trace with many steps. Thought anchors are the critical steps that actually matter for the final answer.

The Resampling Method

To identify thought anchors, the paper uses this approach:

  1. Split the reasoning trace into chunks (sentences or paragraphs)
  2. For each chunk:
    • Remove it from the trace
    • Generate 100 new continuations from that point
    • Measure how accuracy changes
  3. Classify based on accuracy change:
    • IMPORTANT: Removing this chunk significantly hurts accuracy
    • NEGLIGIBLE: Removing this chunk doesn't affect accuracy
    • UNCERTAIN: Can't tell with statistical confidence

The Problem: Cost

For a single problem with 150 chunks, the brute-force method requires:

150 chunks × 100 samples = 15,000 API calls

This project explores how to achieve the same results with fewer API calls.


The Five Sampling Strategies

Strategy 1: Paper (Brute Force)

The baseline approach from the original paper.

Aspect Value
Chunks sampled All (100%)
Samples per chunk 100 (fixed)
Early stopping No
Interpolation No

How it works:

  1. Sample every chunk exactly 100 times
  2. Classify based on bootstrap confidence interval

Strategy 2: Pure Adaptive

Sample every chunk, but stop early when the classification is statistically confident.

Aspect Value
Chunks sampled All (100%)
Samples per chunk 10-100 (adaptive)
Early stopping Yes
Interpolation No

How it works:

  1. Start with 10 samples for a chunk
  2. Compute bootstrap confidence interval (CI) for importance
  3. If CI decisively classifies the chunk → stop early
  4. Otherwise, add more samples: 20, 30, 50, 70, then 100

Adaptive Schedule: [10, 20, 30, 50, 70, 100]

At each checkpoint, we stop if:

  • IMPORTANT: CI lower bound > 0 (removing chunk hurts accuracy)
  • NEGLIGIBLE: CI upper bound < 0 (removing chunk helps or doesn't matter)

Most chunks are clearly important or clearly negligible, so early stopping saves ~49% of samples while maintaining 100% recall.


Strategy 3: Pure Sparse (with Fill-in)

Sample only every Nth chunk, fill in gaps when accuracy changes significantly.

Aspect Value
Chunks sampled ~33% initially (every 3rd)
Samples per chunk 100 (fixed, no adaptive)
Early stopping No
Interpolation Yes (for remaining gaps)

How it works:

Pass 1 - Sparse Sampling:

Chunks:   0   1   2   3   4   5   6   7   8   9   10  11  ...
Sampled:  ✓   -   -   ✓   -   -   ✓   -   -   ✓   -   -   ...

Sample chunks 0, 3, 6, 9, ... (every 3rd) with full 100 samples.

Pass 2 - Fill-in: If accuracy jumps significantly between sparse samples, fill in the gap:

Chunk 3 accuracy: 0.85
Chunk 6 accuracy: 0.45   ← Jump of 0.40 detected!

Fill in chunks 4 and 5 with 100 samples each

Pass 3 - Interpolation: Remaining chunks inherit classification from neighbors (see Interpolation Modes).


Strategy 4: Hybrid (Sparse + Adaptive)

Combines the best of both: skip chunks AND stop early.

Aspect Value
Chunks sampled ~33% initially (every 3rd)
Samples per chunk 10-100 (adaptive)
Early stopping Yes
Interpolation Yes (for remaining gaps)

How it works:

Same three-pass structure as Pure Sparse, but uses adaptive sampling instead of fixed 100 samples:

Pass 1 - Sparse with Adaptive: Sample every 3rd chunk using adaptive early stopping (10→100 samples)

Pass 2 - Fill-in with Adaptive: When accuracy jumps detected, fill in gaps using adaptive sampling

Pass 3 - Interpolation: Remaining chunks inherit from neighbors


Strategy 5: Evolved Hybrid (OpenEvolve-Optimized)

An LLM-evolved algorithm that discovered novel optimizations beyond hand-tuned strategies.

Aspect Value
Chunks sampled ~20% initially (every 5th)
Samples per chunk 6-80 (aggressive adaptive)
Early stopping Yes (more aggressive)
Interpolation Yes (with smart verification)

Key optimizations discovered by OpenEvolve:

  1. Shorter schedule [6, 12, 24, 48, 80] - start earlier, exit faster
  2. Aggressive early exit at mean_effect > 0.12
  3. Lower confidence (0.90 vs 0.95) - tolerates more uncertainty
  4. Fewer bootstrap reps (400-600 vs 1000) - faster decisions
  5. Distance-aware verification - only verify interpolated IMPORTANT chunks when neighbors are ambiguous

See openevolve/README.md for how this algorithm was discovered.


Technical Details

Bootstrap Confidence Interval

Classification uses bootstrap resampling to compute confidence intervals:

def bootstrap_ci(values, confidence=0.95, n_reps=1000):
    # values: list[float] of effects, where effect = baseline - is_correct
    arr = np.array(values)
    rng = np.random.default_rng(42)
    means = [np.mean(rng.choice(arr, size=len(arr), replace=True)) for _ in range(n_reps)]
    alpha = 1 - confidence
    return np.percentile(means, [100 * alpha / 2, 100 * (1 - alpha / 2)])

Classification logic:

  • If ci_lower > 0: IMPORTANT (removing chunk hurts accuracy)
  • If ci_upper < 0: NEGLIGIBLE (removing chunk doesn't hurt)
  • Otherwise: UNCERTAIN

Fill-in Trigger

Fill-in occurs when accuracy changes significantly between sparse samples:

jump_threshold = 0.10  # 10% accuracy change

if abs(accuracy[chunk_i+3] - accuracy[chunk_i]) >= jump_threshold:
    # Fill in chunks i+1 and i+2

Threshold choices:

  • 0.001 (essentially 0) = Fill ALL gaps → maximum accuracy, moderate savings
  • 0.05 = Fill on 5% jumps → good balance
  • 0.10 = Fill on 10% jumps → more aggressive savings

Interpolation Modes

When a chunk isn't sampled, it must be classified by interpolation:

Conservative (recommended):

# If either neighbor is IMPORTANT, classify as IMPORTANT
if prev.classification == "IMPORTANT" or next.classification == "IMPORTANT":
    return "IMPORTANT"
# If neighbors agree, use that classification
elif prev.classification == next.classification:
    return prev.classification
else:
    return "UNCERTAIN"

Nearest:

# Use classification from the closer neighbor
if distance_to_prev <= distance_to_next:
    return prev.classification
else:
    return next.classification

Importance Formula

From the paper, importance measures how much removing a chunk affects accuracy:

importance(chunk) = P(correct | with chunk) - P(correct | without chunk)
                  = baseline_accuracy - accuracy_when_chunk_removed

Where:

  • baseline_accuracy = 1.0 if base solution is correct, 0.0 if incorrect
  • accuracy_when_chunk_removed = fraction of rollouts that are correct

Positive importance = chunk helps (anchor candidate) Negative importance = chunk actually hurts accuracy Zero importance = chunk doesn't matter


Results

Math Problems (40 problems, 6,473 chunks, 1,612 thought anchors)

Strategy Samples Savings Recall Precision F1
Paper (Brute Force) 647,300 0% 100% 100% 100%
Pure Adaptive 332,620 48.6% 100% 99.9% 99.9%
Pure Sparse 276,400 57.3% 97.9% 94.0% 95.9%
Hybrid 137,680 78.7% 98.0% 94.1% 96.0%
Evolved Hybrid 56,750 91.2% 97.7% 87.1% 92.1%

Blackmail Scenarios (10 scenarios, 935 chunks)

We also tested on blackmail scenarios from the Thought Branches paper (Macar et al.), which analyzes LLM reasoning in ethical decision-making:

Strategy Samples Savings Recall Precision F1
Paper (Brute Force) 93,500 0% 100% 100% 100%
Pure Adaptive 78,440 16.1% 100% 100% 100%
Pure Sparse 38,600 58.7% 99.4% 97.4% 98.4%
Hybrid 28,020 70.0% 99.4% 96.9% 98.1%
Evolved Hybrid 13,598 85.5% 92.1% 97.1% 94.3%

Metrics note: For blackmail, NEGLIGIBLE chunks are the "positive" class—these are chunks that cause blackmail behavior (removing them increases safety). Conversely, IMPORTANT means removing the chunk reduces safety (the chunk was preventing blackmail). So Recall = "of chunks causing blackmail, how many did we identify?" This differs from math where IMPORTANT is the positive class. Savings are computed relative to actual rollouts available (not assumed 100 per chunk).

Note: Blackmail scenarios show lower adaptive savings (16.1% vs 48.6%) because the reasoning patterns are more complex—chunk impacts are often ambiguous, requiring more samples to reach statistical confidence.

Whistleblower Scenarios (20 scenarios, 1,040 chunks)

We also tested on whistleblower scenarios from the Thought Branches paper, analyzing LLM reasoning when deciding whether to autonomously report wrongdoing (e.g., reporting clinical trial data falsification to the FDA):

Strategy Samples Savings Agreement Prec(IMP) Rec(IMP) F1(IMP)
Paper (Brute Force) 103,634 0% 100% 100% 100% 100%
Pure Adaptive 64,672 37.6% 100% 100% 100% 100%
Pure Sparse 48,544 53.2% 97.4% 99.0% 95.8% 97.2%
Hybrid 26,404 74.5% 97.5% 99.0% 95.7% 97.2%
Evolved Hybrid 13,022 87.4% 94.8% 87.6% 98.7% 91.9%

Metrics note: Whistleblower scenarios have no NEGLIGIBLE chunks—removing any chunk reduces the probability of whistleblowing behavior. Instead of precision/recall for NEGLIGIBLE, we measure IMPORTANT vs UNCERTAIN classification (where IMPORTANT = removing chunk significantly reduces whistleblowing, UNCERTAIN = effect is ambiguous). Agreement = % of chunks with same classification as paper baseline. Savings are computed relative to actual rollouts available (not assumed 100 per chunk).

Key observations:

  • The dataset splits 48.5% IMPORTANT / 51.5% UNCERTAIN / 0% NEGLIGIBLE
  • Evolved Hybrid achieves 98.7% recall on IMPORTANT chunks (rarely misses critical reasoning steps)
  • The lower precision (87.6%) reflects that with fewer samples, more chunks are conservatively classified as IMPORTANT rather than UNCERTAIN

Key Findings

  1. Pure Adaptive is the safe choice: On math, 48.6% savings with 100% recall/precision. On blackmail and whistleblower, 16-38% savings with 100% agreement (in our runs). Since it samples every chunk, it should theoretically match the paper baseline. The safest option when accuracy matters most.

  2. Hybrid achieves good savings: 78.7% on math, 70.0% on blackmail, 74.5% on whistleblower—consistent savings across domains while maintaining high recall (95%+).

  3. Evolved Hybrid maximizes savings: 91.2% on math, 85.5% on blackmail, 87.4% on whistleblower. Best for cost-sensitive applications where some precision loss is acceptable.

  4. Domain affects savings: Math problems have clearer "important vs not" chunks, enabling more aggressive early stopping. Safety scenarios (blackmail, whistleblower) have more nuanced reasoning patterns.

  5. Different classification challenges: Math/blackmail have NEGLIGIBLE chunks to find; whistleblower has none (all chunks reduce whistleblowing when removed). The strategies adapt to both types of problems.

  6. Results are robust to rollout ordering: We tested whether the order of rollouts within each chunk affects results by running all analyses with randomized rollout order (seed=42). Differences were minimal (<1% for most metrics), confirming no systematic ordering bias in the datasets. The --randomize flag is available in all analysis scripts.

Recommendation

Use Case Strategy Math Blackmail Whistleblower Why
Research (100% recall) Pure Adaptive 48.6% 16.1% 37.6% Finds ALL anchors
Balanced Hybrid 78.7% 70.0% 74.5% Good savings, high recall
Production (cost-sensitive) Evolved Hybrid 91.2% 85.5% 87.4% Maximum savings
Quick exploration Pure Sparse 57.3% 58.7% 53.2% Simple, no adaptive

File Structure

frugal-thought-anchors/
├── README.md                      # This file
├── pyproject.toml                 # Project dependencies
├── run_analysis.py                # Main analysis script for math problems
├── run_blackmail_analysis.py      # Analysis script for blackmail scenarios
├── run_whistleblower_analysis.py  # Analysis script for whistleblower scenarios
│
├── sampling_strategies/           # Sampling strategy implementations
│   ├── __init__.py                # Package exports
│   ├── common.py                  # Shared types and utilities
│   ├── adaptive.py                # Pure adaptive sampling
│   ├── sparse.py                  # Pure sparse sampling
│   ├── hybrid.py                  # Hybrid (sparse + adaptive)
│   └── evolved_hybrid.py          # OpenEvolve-optimized algorithm
│
├── rollout_loader.py              # Load math rollouts from disk
├── rollout_loader_blackmail.py    # Load blackmail rollouts from disk
├── rollout_loader_whistleblower.py # Load whistleblower rollouts from disk
├── download_problems.py           # Download math rollouts from HuggingFace
├── download_blackmail.py          # Download blackmail rollouts from HuggingFace
├── download_whistleblower.py      # Download whistleblower rollouts from HuggingFace
│
├── openevolve/                    # OpenEvolve algorithm discovery (see openevolve/README.md)
│   └── ...
│
├── hf_math_rollouts/              # Downloaded math rollouts (created by download_problems.py)
├── hf_blackmail_rollouts/         # Downloaded blackmail rollouts (created by download_blackmail.py)
└── hf_whistleblower_rollouts/     # Downloaded whistleblower rollouts (created by download_whistleblower.py)

Usage Examples

Run the full analysis

uv run python run_analysis.py

Save results to JSON

uv run python run_analysis.py --save

Download rollouts

# Download the 40-problem evaluation subset used in this README
uv run python download_problems.py --problems 330 1591 2050 2137 2189 2236 2238 2870 3360 3448 3550 3916 3935 4019 4164 4605 4682 6481 6596 6998

# Download specific problems
uv run python download_problems.py --problems 4682 330 1591

# List available problems
uv run python download_problems.py --list-only

# Download *everything* available in the HF dataset (can be very large)
uv run python download_problems.py --all

Use individual strategies in Python

from rollout_loader import load_problem, MODEL_PATH
from sampling_strategies.adaptive import simulate_problem, OPTIMAL
from sampling_strategies.sparse import sparse_sample_problem, DEFAULT as SPARSE_DEFAULT
from sampling_strategies.hybrid import hybrid_sample_problem, HybridConfig
import sampling_strategies.evolved_hybrid as evolved_hybrid

# Load a problem
problem = load_problem(MODEL_PATH / "correct_base_solution" / "problem_4682")
baseline = 1.0 if problem.base_correct else 0.0

# Run pure adaptive
adaptive_result = simulate_problem(problem, OPTIMAL)
print(f"Adaptive: {adaptive_result.savings_percent:.1f}% savings")

# Run pure sparse
sparse_result = sparse_sample_problem(problem, SPARSE_DEFAULT)
print(f"Sparse: {sparse_result.savings_percent:.1f}% savings")

# Run hybrid with custom config
config = HybridConfig(
    sparse_step=3,
    adaptive_schedule=[10, 20, 30, 50, 70, 100],
    jump_threshold=0.10,
)
hybrid_result = hybrid_sample_problem(problem, config)
print(f"Hybrid: {hybrid_result.savings_percent:.1f}% savings")

# Run evolved hybrid (OpenEvolve-optimized)
evolved_result = evolved_hybrid.sample_problem(problem.chunk_rollouts, baseline)
total_samples = sum(s for _, s, _ in evolved_result.values())
print(f"Evolved: {total_samples} samples")

About

Efficient thought anchor detection for interpretability research

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages