Enhance local AI review with full-file context, delta-diff re-review, and cost visibility by igerber · Pull Request #232 · igerber/diff-diff

igerber · 2026-03-22T14:40:49Z

Summary

Add full changed source file context to review prompt (catches "sins of omission" — code that should have changed but wasn't)
Add import-graph expansion for deep context mode (files imported by changed files included as read-only reference)
Add configurable --context minimal|standard|deep levels (default: standard)
Add delta-diff re-review mode — on subsequent runs, only changes since the last review are sent as primary context, with the full branch diff as reference
Add sticky finding tracking via review-state.json — findings are matched across rounds by severity + summary fingerprint, enabling structured status tracking (open/addressed)
Add parameter propagation (anti-pattern 4) and semantic contract violation (anti-pattern 5) checks to pr_review.md — shared between local and CI review
Add token cost visibility — estimated cost before API call, actual cost from API response
Add --include-files for selective extra context and --token-budget for controlling import-context inclusion
Add --base-ref and --commit-sha arguments for robust review state tracking
Add 37 new tests (41 → 78 total) covering all new functions including edge cases for finding parsing/matching

Methodology references

N/A — no methodology/estimator changes. All changes are confined to the local AI review tooling (openai_review.py, ai-review-local.md), review criteria (pr_review.md), and tests.

Validation

Tests added/updated: tests/test_openai_review.py (78 tests, all passing)
New test classes: TestResolveChangedSourceFiles, TestReadSourceFiles, TestParseImports, TestExpandImportGraph, TestEstimateCost, TestTokenBudget, TestParseReviewState, TestWriteReviewState, TestParseReviewFindings, TestMergeFindings, TestCompilePromptWithContext
Local AI review ran 3 rounds: Round 1 (⚠️ 2 P1s) → Round 2 (⚠️ 2 P1s) → Round 3 (✅ Looks good)
CI compatibility verified: pr_review.md changes are strictly additive (anti-patterns 4-5 appended), all existing anti-patterns intact, substitution drift test passes

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

… and cost visibility Add 9 improvements to /ai-review-local: full changed file contents (standard context), import-graph expansion (deep context), configurable --context levels, delta-diff re-review mode, sticky finding tracking via review-state.json, parameter propagation and semantic contract anti-patterns in pr_review.md, token cost visibility, --include-files for selective context, and --token-budget for controlling context size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…en budget Address all P1/P2 findings from local AI review: - P1: merge_findings() now matches by (severity, file_path, summary_fingerprint) instead of (location, severity) to prevent false "addressed" on line shifts - P1: parse_review_findings() requires bold severity (**P1**) format, skips table rows, assessment lines, instructional prose, and multi-severity refs - P2: Use list-per-key in merge to handle duplicate findings without overwrite - P2: Refactor apply_token_budget() to take typed params; source files are always included (sticky); only import-context governed by budget - P2: Add --base-ref argument instead of brittle branch_info parsing - P2: Add 8 new tests covering parser rejection and merge edge cases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- P1: Simplify _finding_key() to (severity, summary_fingerprint) — file path excluded from primary key to prevent false "addressed" when location shifts or disappears between review rounds - P1: Fix parser regex to reliably match **Severity:** P1 format (4 capture groups covering **P1**, **Severity:** P1, **Severity: P1**, Severity: P1) - P2: Fix _finding_key() docstring to match actual implementation - P2: Fix test_matching_with_missing_location to assert correct behavior (no false addressed) instead of accepting the bug - Add 3 positive parser format tests (bold, bold-label, plain-label) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-22T14:50:29Z

Overall Assessment

⛔ Blocker

Executive Summary

The new full-file context modes create a real secret-leak path: the workflow now uploads full source/import/include files, but the pre-upload secret scan still inspects only the Git diff.
Sticky finding tracking is not reliable yet: the parser misses contract-conforming multi-line findings, and the merge logic can silently mislabel duplicate findings across files.
Review-state reuse ignores the stored branch/base metadata, so delta re-review can contaminate one branch with another branch’s findings and diff base.
Cost visibility is wrong for valid snapshot model IDs because overlapping prefixes are matched in the wrong order.

Methodology

No findings. The diff is limited to local review tooling, prompt text, and tests; it does not change any estimator, weighting, variance/SE, identification assumption, or default behavior under diff_diff/, so docs/methodology/REGISTRY.md is not implicated here.

Code Quality

Severity: P1. .claude/scripts/openai_review.py:L406-L519, .claude/scripts/openai_review.py:L1327-L1342, .github/codex/prompts/pr_review.md:L72-L77. parse_review_findings() only captures cases where severity and summary appear on the same line. Impact: a review that follows the documented multi-field report structure (Severity, Impact, Concrete fix) can parse as zero findings, and the write path then marks every prior finding addressed. Concrete fix: parse findings as blocks rather than single lines, and if parsing yields zero current findings while severity markers are present, warn and preserve prior findings instead of auto-addressing them.
Severity: P1. .claude/scripts/openai_review.py:L522-L574, .claude/scripts/openai_review.py:L540-L574. merge_findings() matches only on (severity, first 50 chars of summary) and ignores location. Impact: two findings with the same summary in different files can be misattributed; in a simple reproduction, the code emits bar.py as both open and addressed while dropping the actually fixed foo.py finding entirely. Concrete fix: include normalized file/location in the primary key when available, and only fall back to summary-only matching when there is a unique candidate.
Severity: P1. .claude/scripts/openai_review.py:L648-L681, .claude/scripts/openai_review.py:L665-L671, .claude/scripts/openai_review.py:L1291-L1300, .claude/scripts/openai_review.py:L1344-L1358. The prefix fallback in estimate_cost() checks gpt-4.1 before gpt-4.1-mini and o3 before o3-mini. Impact: valid snapshot IDs such as gpt-4.1-mini-2025-04-14 and o3-mini-2025-01-31 are priced against the wrong parent tier; I verified locally that both currently return $10.00 max for a 1M/1M token example. Concrete fix: match the longest known prefix first, or normalize model names to an explicit family map before pricing. Official model pages list those snapshot aliases and their lower mini-tier pricing. (platform.openai.com)

Performance

Severity: P2. .claude/scripts/openai_review.py:L1166-L1178, .claude/scripts/openai_review.py:L818-L879, .claude/commands/ai-review-local.md:L201-L205. In delta-diff mode, source/import context is still built from the full branch --changed-files list, not --delta-changed-files. Impact: re-reviews remain much larger than advertised and can re-expose unchanged branch code, reducing the expected token/cost savings of the new mode. Concrete fix: when delta mode is active, derive full-file/import context from the delta file list and keep any full-branch material explicitly reference-only.

Maintainability

Severity: P1. .claude/commands/ai-review-local.md:L163-L190, .claude/scripts/openai_review.py:L354-L380, .claude/scripts/openai_review.py:L392-L399. The state file stores branch and base_ref, but the reuse path only checks whether last_reviewed_commit exists locally. Impact: after a branch switch or certain rebases, delta mode can compare unrelated histories and carry stale findings into a different branch’s review. Concrete fix: validate stored branch/base_ref against the current invocation and require last_reviewed_commit to be ancestor/merge-base compatible before reusing state; otherwise discard the state and run fresh.

Tech Debt

No findings. I did not find an existing TODO.md entry that mitigates the issues above, and this PR does not add one.

Security

Severity: P0. .claude/commands/ai-review-local.md:L132-L157, .claude/commands/ai-review-local.md:L388-L392, .claude/scripts/openai_review.py:L1166-L1207. The pre-upload secret scan still searches only the diff, but the new feature uploads full changed source files, import-expanded files, and arbitrary --include-files. Impact: unchanged secrets elsewhere in a changed file, or any secret in a user-specified include file, can be transmitted to OpenAI with no warning. Concrete fix: build the exact outbound payload first, scan every file/content block that will be uploaded (not just diff hunks), and block or require an explicit confirmation step on any secret/sensitive-file hit.

Documentation/Tests

Severity: P2. .github/codex/prompts/pr_review.md:L72-L77, .claude/scripts/openai_review.py:L406-L574, .claude/scripts/openai_review.py:L648-L681, tests/test_openai_review.py:L595-L610, tests/test_openai_review.py:L676-L739, tests/test_openai_review.py:L747-L938. The new suite adds breadth, but it misses the failure modes above: no multi-line finding-block parser case, no duplicate-summary identity assertion, no branch/base-ref reuse case, and no overlapping-prefix pricing case. Impact: all of the correctness issues above can pass with the added tests green. Concrete fix: add one regression test per scenario and assert both parsed content and merged status identity.

I could not run pytest in this environment because pytest is not installed, so the review above is based on static inspection plus targeted python3 checks of the new helpers.

Path to Approval

Scan the actual outbound review payload, not just the diff. That needs to cover full changed files, import-context files, --include-files, and any prior-review artifact that will be retransmitted.
Rework finding parsing so the documented Severity / Impact / Concrete fix structure is parseable, and add a fail-safe that never auto-marks prior findings addressed when current parsing is ambiguous or empty.
Tighten finding matching to use normalized file/location when available, with summary-only fallback only for unique candidates.
Validate review-state.json against the current branch/base before generating a delta diff; if the stored state is for a different branch or incompatible history, discard it and run fresh.
Fix cost estimation by matching the longest model prefix first and add regression tests for snapshot IDs such as gpt-4.1-mini-* and o3-mini-*.

…g, pricing P0: Expand secret scan (Step 3c) to cover full source files, import-expanded files, and --include-files — not just diff hunks P1: Rewrite parse_review_findings() as block-based parser supporting multi-line Severity/Impact/Concrete-fix format. Add fail-safe: when parsing yields zero findings but severity markers exist, preserve prior findings instead of marking all as addressed. Return type changed to tuple[list[dict], bool]. P1: Implement tiered finding matching — primary key uses (severity, file_basename, summary[:50]) when file path available; fallback uses (severity, summary[:50]) only for findings without file paths and only with unique candidates. Prevents cross-file misattribution while still handling missing locations. P1: Fix estimate_cost() to match longest model prefix first, so gpt-4.1-mini and o3-mini snapshots get correct mini-tier pricing instead of parent. P1: Add branch/base validation to review-state.json reuse in skill Step 4. Discard stale state when stored branch or base_ref doesn't match current. P2: In delta mode, derive source/import context from delta changed-files list instead of full branch changed-files. P2: Add 7 regression tests: multi-line parser, parse uncertainty flag, same-summary-different-files, mini pricing, o3-mini pricing, delta context derivation, branch/base validation support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T15:21:53Z

/ai-review

github-actions · 2026-03-22T15:29:00Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f3518d19c6edf00d179dd28db3a5699dad1791b2

Overall Assessment

⚠️ Needs changes

The prior P0 secret-scan issue appears addressed, and the earlier cost-prefix bug also looks fixed, but the new re-review state machinery still has three unmitigated P1 correctness gaps that can silently misreport whether earlier findings were actually addressed.

Executive Summary

No methodology impact. The diff is confined to local review tooling, prompt text, and tests; nothing under diff_diff/ or docs/methodology/REGISTRY.md changes estimator math, weighting, variance/SE, assumptions, or defaults.
The previous full-file secret-scan gap appears addressed in the skill instructions, and the previous snapshot-pricing bug appears fixed in estimate_cost().
parse_review_findings() still misses valid multiline findings when the reviewer uses plain Severity: / Impact: labels, and that path still marks prior findings addressed.
merge_findings() still loses identity across rounds when locations are asymmetric or when two different files share the same basename.
The new --force-fresh flag is not actually propagated as a fresh run: previous review context and review-state can still leak into that path, and the re-review validity check still does not verify ancestry after rebases.
The added test breadth is good, but the suite still misses regressions for the remaining parser, matcher, and --force-fresh edge cases.

Methodology

No findings. Affected methods: none. The changed files are .claude/ tooling, .github/codex/prompts/pr_review.md, and tests/; there is no estimator, weighting, variance/SE, identification, or default-behavior change to cross-check against the Methodology Registry.

Code Quality

Severity: P1. Impact: parse_review_findings() still drops contract-conforming multiline findings when the model emits plain Severity: / Impact: labels instead of bold markdown. _IMPACT_PATTERN only matches **Impact:**, and the fail-safe only treats **P0**/**P1**-style markers as uncertain, so write_review_state() can still zero-parse the current review and mark all previous findings addressed. I verified locally that a block formatted as Severity: P1 + Impact: ... returns ([], False) and then flips the prior finding to addressed. Concrete fix: accept plain Impact: / Location: lines, and broaden the uncertainty detector to every supported severity syntax before merging state. Locations: .claude/scripts/openai_review.py:L406, .claude/scripts/openai_review.py:L503, .claude/scripts/openai_review.py:L545, .claude/scripts/openai_review.py:L1400.
Severity: P1. Impact: merge_findings() still misidentifies findings across rounds in two cases the PR intended to handle. First, fallback matching only runs when the current finding lacks a location, so a previous round with no location and a current round with one produces one open plus one addressed record for the same issue. Second, the primary key uses os.path.basename(...), so two distinct files with the same basename still cross-match; I verified this with diff_diff/__init__.py vs diff_diff/visualization/__init__.py. Concrete fix: key on normalized full relative path when available, not basename, and use the unique-candidate fallback symmetrically whenever either side is missing location data. Locations: .claude/scripts/openai_review.py:L555, .claude/scripts/openai_review.py:L595, .claude/scripts/openai_review.py:L610.

Performance

No findings. The earlier delta-context scoping issue looks fixed: source/import context now prefers delta changed files before full-branch changed files.

Maintainability

Severity: P1. Impact: the new --force-fresh flag is not actually wired through as a fresh review. The flag is documented as “run a full fresh review,” but Step 4 still preserves the previous review, Step 5 still always passes --review-state and can still pass --previous-review, and the state-validity gate is only git cat-file -t, which checks object existence, not ancestor compatibility after a same-branch rebase. That means a supposedly fresh run can still be biased by stale review text/state, and rebased branches can still generate the wrong delta scope as long as the old commit object remains in the repo. Concrete fix: on --force-fresh, explicitly delete or ignore both .claude/reviews/review-state.json and .claude/reviews/local-review-latest.md, omit --review-state / --previous-review on that path, and on normal re-reviews replace the existence check with git merge-base --is-ancestor "$last_reviewed_commit" HEAD or equivalent. Locations: .claude/commands/ai-review-local.md:L23, .claude/commands/ai-review-local.md:L206, .claude/commands/ai-review-local.md:L225, .claude/commands/ai-review-local.md:L248, .claude/commands/ai-review-local.md:L257, .claude/commands/ai-review-local.md:L372.

Tech Debt

No findings. I did not find a TODO.md entry that mitigates the open P1/P2 items above.

Security

No findings. The new Step 3c full-file scan appears to close the earlier leak path for changed source files, deep-context imports, and --include-files.

Documentation/Tests

Severity: P2. Impact: the new suite still misses the exact regressions above. It covers bold multiline finding blocks and the case where the current finding lacks a location, but not plain multiline Severity: / Impact: blocks, previous-missing-location matching, same-basename/different-directory collisions, or --force-fresh behavior. Those P1s can still pass green. Concrete fix: add one regression each in TestParseReviewFindings, TestMergeFindings, and a new test covering --force-fresh / re-review-state handling. Locations: tests/test_openai_review.py:L747, tests/test_openai_review.py:L871, tests/test_openai_review.py:L997.

Path to Approval

Make parse_review_findings() parse plain multiline Severity: / Impact: / Location: blocks and treat any supported-but-unparsed severity syntax as parse_uncertain=True.
Rework merge_findings() to use normalized full paths, not basenames, and allow unique fallback matching whenever either the previous or current finding lacks a location.
Make --force-fresh actually fresh by skipping prior review/state inputs end-to-end, and validate normal re-review commits with an ancestor check before generating delta diffs.
Add regression tests for the three scenarios above so the remaining P1s cannot pass with a green suite.

pytest is not installed in this environment, so I could not run the test file directly; the parser/matcher findings above were verified with targeted python3 reproductions against the changed helpers.

…esh, ancestry P1: Accept plain Impact:/Location: labels in multiline blocks (not just bold). Broaden uncertainty detector with non-anchored regex catching Severity: markers anywhere in review text. P1: Use full relative path in finding matching instead of basename to prevent __init__.py collisions across directories. Make fallback matching symmetric: when previous finding lacks location, reverse-match against current findings. P1: Wire --force-fresh to actually delete all prior state (review-state.json, local-review-latest.md, local-review-previous.md) and omit --review-state, --previous-review, --delta-diff args. Replace git cat-file -t with git merge-base --is-ancestor for proper rebase detection. P2: Add 4 regression tests: plain multiline blocks, plain severity uncertainty, previous-missing-location symmetric matching, same-basename-different-dirs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T16:18:43Z

/ai-review

github-actions · 2026-03-22T16:25:35Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f4f2ff76b1dc5a27f6a256f05d5b846c1828f95b

Overall Assessment

⚠️ Needs changes

No methodology-facing code changed, but two unmitigated P1 workflow-correctness issues remain in the local review tooling.

Executive Summary

No estimator, weighting, variance/SE, identification, or default-behavior code changed. The diff is confined to local review tooling, prompt text, and tests, so the Methodology Registry is not implicated.
The earlier plain Severity:/Impact: parsing gap and the full-path/symmetric merge_findings() matching issue appear resolved.
One prior parser P1 is still only partially fixed: the zero-parse fail-safe still misses supported bold-label severity syntaxes, so prior findings can still be silently marked addressed.
--force-fresh itself is now fresh, and ancestor validation was added, but the stale-state fallback still reuses prior review text after invalidation, so a nominally fresh run can still be biased by unrelated old findings.
The expanded tests are useful, but they still miss a regression for the remaining parser blind spot.

Methodology

No findings. Affected methods: none. The changed files are reviewer tooling, prompt instructions, and tests only; nothing under diff_diff/ or docs/methodology/REGISTRY.md changes estimator math, weighting, variance/SE, assumptions, or defaults.

Code Quality

Severity: P1. Impact: parse_review_findings() still does not set parse_uncertain=True for all severity syntaxes that _BLOCK_START accepts. I reproduced parse_review_findings("- **Severity:** P1\n", 2) and parse_review_findings("- **Severity: P1**\n", 2) returning ([], False), which means the main flow will merge against an empty current set and silently mark prior findings addressed. Concrete fix: derive the uncertainty check from the same accepted severity grammar as _BLOCK_START (or explicitly add both bold-label forms), then add regressions for both zero-parse cases. Locations: .claude/scripts/openai_review.py:L406-L410, .claude/scripts/openai_review.py:L554-L564, .claude/scripts/openai_review.py:L1433-L1447

Performance

No findings.

Maintainability

Severity: P1. Impact: the stale-state “run fresh review” path is still not actually fresh. Step 4 deletes review-state.json when branch/base mismatch or merge-base --is-ancestor fails, but it still unconditionally copies .claude/reviews/local-review-latest.md to .claude/reviews/local-review-previous.md; Step 5 then still permits --previous-review whenever that file exists, and the script always treats that as follow-up-review context. That lets unrelated old findings leak into a supposedly fresh review after rebase/branch switch. Concrete fix: only preserve/pass --previous-review when the stored state was validated and delta files were generated; otherwise delete or ignore local-review-previous.md and omit --previous-review entirely. Locations: .claude/commands/ai-review-local.md:L221-L257, .claude/commands/ai-review-local.md:L262-L291, .claude/scripts/openai_review.py:L1249-L1261, .claude/scripts/openai_review.py:L878-L916

Tech Debt

No findings. TODO.md does not contain entries that mitigate the open P1s above.

Security

No findings. The new full-file secret scan is a net improvement over the prior diff-only scan.

Documentation/Tests

Severity: P2. Impact: the enlarged parser test coverage still does not lock down the remaining fail-safe bug above, so the false-addressed path can still pass green. Concrete fix: add explicit regressions asserting that parse_review_findings("- **Severity:** P1\n", round) and parse_review_findings("- **Severity: P1**\n", round) return ([], True). Location: tests/test_openai_review.py:L786-L853

Path to Approval

Make the zero-parse uncertainty guard cover every supported severity syntax, including - **Severity:** P1 and - **Severity: P1**.
Stop carrying forward local-review-latest.md as --previous-review when review-state was invalidated by branch/base mismatch or non-ancestor history.
Add regressions for both cases so the parser/state bugs cannot silently return.

Verification note: pytest is not installed in this environment, so I could not run tests/test_openai_review.py; the findings above were validated with targeted python3 reproductions against the current script.

…ew gating P1: Replace broad_sev regex with line-by-line _BLOCK_START scan in uncertainty detector, so **Severity:** P1 and **Severity: P1** formats correctly trigger parse_uncertain=True when the block parser can't extract findings. P1: Move previous-review preservation inside the validated-state path in Step 4. When state is invalidated (branch mismatch, non-ancestor, rebase), delete local-review-previous.md to prevent stale findings leaking via --previous-review. P2: Add 2 regression tests for bold-label uncertainty detection, fix 2 existing test assertions for mid-line severity markers (correctly non-detected). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T16:49:27Z

/ai-review

github-actions · 2026-03-22T16:54:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a354a2fbd56d87579d2c2fb79c531766ed5ed488

Overall Assessment

⚠️ Needs changes

The prior re-review P1s appear resolved, but one new P1 remains in the new deep-context import expansion path.

Executive Summary

No estimator, weighting, variance/SE, identification, or default-behavior code changed. docs/methodology/REGISTRY.md is not implicated by this diff.
The previous re-review findings look fixed: bold-label severity formats now trigger the parser fail-safe, and stale previous-review carryover after state invalidation is no longer preserved.
Severity P1: the new --context deep feature does not actually include imported submodule files for package code. The import parser collapses diff_diff.foo.bar to diff_diff.foo, so deep context can include the wrong file or no file at all.
I reproduced expand_import_graph([diff_diff/visualization/__init__.py]) == [], even though that file imports _continuous, _diagnostic, _event_study, _power, _staggered, and _synthetic.
The new tests only cover top-level imports and generic existence checks, so this regression currently passes.
pytest is not installed in this environment; findings were validated with targeted python3 reproductions against .claude/scripts/openai_review.py.

Methodology

No findings. I checked the scope against docs/methodology/REGISTRY.md; the PR changes only review tooling, prompt text, and tests, not estimator logic or documented methodology deviations.

Code Quality

Severity: P1. Impact: parse_imports() truncates every diff_diff.* import to diff_diff.<second_component>, and expand_import_graph() then resolves that truncated name. For real package code this silently drops the actual imported module. Example: diff_diff/visualization/__init__.py imports _continuous, _diagnostic, _event_study, _power, _staggered, and _synthetic, but the parser returns only diff_diff.visualization; because that resolves back to the changed __init__.py, deep expansion becomes empty. Likewise diff_diff/visualization/_staggered.py imports diff_diff.visualization._common, but deep mode includes diff_diff/visualization/__init__.py instead of _common.py. That defeats the advertised “files imported by changed files” behavior and can cause local re-reviews to miss omission bugs in sibling modules. Concrete fix: preserve the full imported module path in parse_imports() for both ast.Import and ast.ImportFrom, and for from . import foo append each alias name to the resolved base before calling resolve_module_to_path(). Locations: .claude/scripts/openai_review.py:L216-L266, .claude/scripts/openai_review.py:L327-L344, diff_diff/visualization/init.py:L10-L40, diff_diff/visualization/_staggered.py:L108-L163.

Performance

No findings.

Maintainability

No findings beyond the P1 import-resolution defect above.

Tech Debt

No findings. TODO.md does not currently track the open import-graph defect, so it is not mitigated.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new test coverage does not lock down package/submodule import expansion. TestParseImports and TestExpandImportGraph only exercise top-level imports via bacon.py and basic existence properties, so the deep-context regression above passes unnoticed. Concrete fix: add regressions that assert expand_import_graph() on diff_diff/visualization/__init__.py includes the visualization submodules, and that _staggered.py expansion includes _common.py rather than visualization/__init__.py. Locations: tests/test_openai_review.py:L529-L587.

Path to Approval

Fix parse_imports() so it keeps full submodule names instead of truncating to diff_diff.<second_component>, including alias-only relative imports like from . import foo.
Add regression tests covering real package imports, especially diff_diff/visualization/__init__.py and diff_diff/visualization/_staggered.py, so deep-context expansion must include _event_study.py, _power.py, _staggered.py, _common.py, etc.

P1: parse_imports() no longer truncates to 2 components. Full module paths like diff_diff.visualization._common are preserved. For 'from . import foo' style relative imports, each alias is appended to the resolved base package (e.g., diff_diff.visualization._event_study instead of diff_diff.visualization). P2: Add 3 regression tests: submodule import truncation check, relative import alias expansion, visualization __init__.py expansion includes submodule files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T16:59:58Z

/ai-review

github-actions · 2026-03-22T17:07:07Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6f49201648f26fb7195643e88cecfd3a55b18fad

Overall Assessment

⚠️ Needs changes

No estimator, weighting, variance/SE, identification, or default-behavior code changed, so the methodology registry is not implicated here. The prior deep-context import-resolution defect appears fixed, but the new sticky finding-tracking path still has two unmitigated P1 correctness issues.

Executive Summary

No methodology findings. I checked the diff scope against docs/methodology/REGISTRY.md; this PR only changes local review tooling, prompt text, and tests.
The previous deep-context P1 looks resolved: parse_imports() now keeps full submodule paths, and expand_import_graph() resolves real visualization submodules correctly. openai_review.py:L216 openai_review.py:L326
Severity: P1. Sticky finding tracking is not stable for the common single-line finding format; a pure line-number shift turns one persistent finding into “addressed + new open.”
Severity: P1. When the parser hits its own parse_uncertain fallback, the script still advances the stored commit/round, so later delta reviews can skip code that was never successfully re-parsed.
The new tests do not cover either failure mode end-to-end. pytest is not installed in this environment; the findings below were validated with targeted python3 reproductions against the script.

Methodology

No findings. The changed files are tooling and prompt/test assets, not estimator implementations, so there is nothing to cross-check against the academic-method sections of docs/methodology/REGISTRY.md.

Code Quality

Severity: P1. Impact: parse_review_findings() keeps inline locations like foo.py:L10 inside the summary it extracts from single-line findings, and _finding_keys() fingerprints that raw summary. I reproduced **P1** Missing NaN guard in \foo.py:L10`followed by the same finding atL12; merge_findings()marked the old one addressed and the new one open instead of matching them as the same issue. That breaks the advertised “sticky finding tracking” on routine line shifts. Concrete fix: normalize summaries before matching by stripping inlinepath.py:Lnn` spans/backticks, or compute a separate location-free fingerprint field and match on that. Add an end-to-end regression that parses both review texts before merging. Locations: openai_review.py:L516 openai_review.py:L576 openai_review.py:L611
Severity: P1. Impact: when parse_review_findings() returns parse_uncertain=True, the script preserves the previous findings but still writes a new review-state.json with the current commit_sha and incremented review_round. That moves the delta baseline forward even though the review output was not successfully parsed, so the next re-review can treat unresolved changes as already reviewed unless the user manually falls back to --force-fresh. Concrete fix: if parse_uncertain fires, leave the prior review state untouched, or write an explicit parse_failed marker and suppress delta mode on the next run instead of advancing last_reviewed_commit. Locations: openai_review.py:L1438 openai_review.py:L1449 tests/test_openai_review.py:L890

Performance

No findings.

Maintainability

No findings beyond the two code-quality defects above.

Tech Debt

No findings. TODO.md does not currently track the open P1s above, so they are not mitigated for assessment purposes.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new test suite does not exercise the real parse->merge pipeline for single-line findings, and it does not verify the state-machine behavior when parse_uncertain fires. test_matching_with_shifted_line_numbers bypasses the parser by feeding already-normalized summaries into merge_findings(), so the current churn bug passes unnoticed. Concrete fix: add regressions that (1) parse two single-line reviews differing only by line number and assert one open finding after merge, and (2) assert that the parse_uncertain path does not advance the persisted commit anchor. Locations: tests/test_openai_review.py:L824 tests/test_openai_review.py:L890 tests/test_openai_review.py:L986

pytest is not installed here, so I could not run tests/test_openai_review.py. The P1s above were confirmed with targeted python3 reproductions against openai_review.py.

Path to Approval

Normalize single-line finding summaries before fingerprinting so inline file.py:Lnn text does not cause false “addressed/new” churn.
Change the parse_uncertain path so it does not advance last_reviewed_commit or review_round when the current review output could not be parsed reliably.
Add end-to-end regressions for both flows: single-line line-shift matching and parse_uncertain state preservation.

P1: Strip inline file.py:Lnn references from summaries before fingerprinting in _finding_keys(). Uses lowercase regex since summaries are already lowercased. Prevents false addressed+open churn when the same finding shifts line numbers. P1: Skip write_review_state() entirely when parse_uncertain fires. This prevents advancing the delta baseline past unparsed code, so the next re-review correctly covers the unreviewed changes. P2: Add 2 end-to-end regression tests: parse-then-merge pipeline verifying line-shift matching produces 1 open / 0 addressed, and parse_uncertain state preservation verifying the state file is not modified. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T17:13:53Z

/ai-review

github-actions · 2026-03-22T17:21:54Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6fc32375de393f05badd7f6d8530d1abf5d8d4e2

Overall Assessment

⚠️ Needs changes

No estimator, weighting, variance/SE, identification, or default-behavior code changed, so I found no methodology issues. One prior P1 is fixed, but one prior P1 remains partially open, and there is one additional P1 in the new re-review workflow.

Executive Summary

No methodology findings. I checked docs/methodology/REGISTRY.md; the diff only touches local review tooling, prompt text, and tests.
The prior line-shift churn defect is resolved: _finding_keys() now strips inline file.py:Lnn text before matching, and the new parse->merge regression covers that path (.claude/scripts/openai_review.py:L568-L588, tests/test_openai_review.py:L1175-L1188).
The prior parse_uncertain finding is only partially fixed. The script now preserves state when prior structured findings exist, but a parse failure after a clean round still advances the review baseline.
New P1: the branch/base mismatch path in the command workflow deletes review-state.json but can still leave stale local-review-previous.md in place, and Step 5 still says to pass --previous-review whenever that file exists.
pytest is not installed in this environment, so I validated the parser/merge behavior with targeted python3 reproductions instead of the full test file.

Methodology

No findings. The changed files are review tooling and review-prompt assets, not estimator implementations, so the Methodology Registry and paper references are not implicated.

Code Quality

Severity: P1. Impact: the old parse-failure baseline bug is only partially fixed. In main(), state writes are skipped only when parse_uncertain is true and structured_findings is truthy; after a previous clean round or any state file with zero findings, the same parse failure still falls through to the unconditional write branch and advances last_reviewed_commit, so the next delta review can skip code that was never successfully re-parsed. Location: .claude/scripts/openai_review.py:L1438-L1469. Concrete fix: make parse_uncertain suppress all state writes regardless of whether prior findings are empty, or persist a parse_failed marker that disables delta mode on the next run.
Severity: P1. Impact: the branch/base mismatch path discards review-state.json but does not delete .claude/reviews/local-review-previous.md, even though the next step still says to pass --previous-review whenever that file exists. If a user switches branches after a rerun, stale findings from another branch/base can still be injected into what is supposed to be a fresh review, biasing the model’s assessment. Location: .claude/commands/ai-review-local.md:L227-L231, .claude/commands/ai-review-local.md:L260-L286. Concrete fix: delete local-review-previous.md in the branch/base mismatch block and/or gate --previous-review on an explicit “delta mode active” flag instead of raw file existence.
Severity: P2. Impact: the new cost-visibility feature underreports spend for the default model because PRICING["gpt-5.4"] is hardcoded as $2.00 input / $8.00 output per 1M tokens, while the current official GPT-5.4 pricing is $2.50 / $15.00, and gpt-5.4 is also the default model. Location: .claude/scripts/openai_review.py:L752-L757, .claude/scripts/openai_review.py:L991-L994. Concrete fix: update the pricing table and add a regression that asserts the configured default model’s pricing entry. citeturn19open0

Performance

No findings.

Maintainability

No findings beyond the code-quality issues above.

Tech Debt

No findings. TODO.md does not currently track the open issues above, so they are unmitigated for assessment purposes.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new regression around parse-failure state preservation only checks that a temp file’s mtime stays unchanged after calling parse_review_findings() directly; it never exercises main() and therefore does not cover the still-open structured_findings is None branch. There is also no workflow regression covering branch/base invalidation while a stale local-review-previous.md exists. Location: tests/test_openai_review.py:L1190-L1218. Concrete fix: add an end-to-end state-write regression for parse_uncertain=True with zero prior findings, and add a workflow regression that invalidates branch/base state while a stale previous-review file exists and asserts that no --previous-review context is passed.

Verification note: python3 -m pytest tests/test_openai_review.py -q could not run here because pytest is not installed.

Path to Approval

In .claude/scripts/openai_review.py, change the parse_uncertain path so no review-state.json update occurs on any parse failure, including prior-clean rounds / empty finding sets.
In .claude/commands/ai-review-local.md, clear or ignore .claude/reviews/local-review-previous.md whenever branch/base validation fails, and only pass --previous-review after state validation has actually activated delta mode.
Add regressions for both edge cases: parse failure after a prior zero-finding round, and branch/base mismatch with a stale previous-review file present.

…cing P1: parse_uncertain now unconditionally skips all write_review_state() calls regardless of whether prior findings exist. Prevents advancing delta baseline after any parse failure, not just when structured_findings is truthy. P1: Branch/base mismatch block now also deletes local-review-previous.md to prevent stale findings from leaking via --previous-review after branch switch. P2: Update gpt-5.4 pricing to $2.50/$15.00 per 1M tokens (was $2.00/$8.00). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T17:24:52Z

/ai-review

github-actions · 2026-03-22T17:31:18Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 68c035241ed79c67990349c1bfda5df78caed953

Overall Assessment

⚠️ Needs changes

No estimator, weighting, variance/SE, identification, or default-behavior methodology changed in this diff, so I found no registry/paper adherence issues. The two prior P1s are addressed, but one unmitigated P1 remains in the new sticky-finding logic.

Executive Summary

No methodology findings. I checked docs/methodology/REGISTRY.md; the diff is confined to review tooling, prompt criteria, and tests.
The prior parse_uncertain baseline bug is fixed: state writes are now skipped on any parse failure, not only when prior findings exist, in openai_review.py#L1438.
The prior stale-previous-review leak is fixed: branch/base invalidation now deletes local-review-previous.md, and Step 5 only reuses it if that file still exists, in ai-review-local.md#L228 and ai-review-local.md#L269.
Severity P1 [Newly identified]: the line-shift churn fix is still .py-only, so findings on .md and other non-Python files can still flip from open to false addressed after line-number movement.
Severity P2: --token-budget undercounts rerun-only prompt sections, so deep-context import files can still be included after the actual rerun prompt has already exceeded the requested budget.
python3 -m pytest tests/test_openai_review.py -q could not run here because pytest is not installed; I validated the open issues with targeted python3 reproductions against the module.

Methodology

No findings. The changed files do not alter any estimator implementation or any documented methodology deviation in docs/methodology/REGISTRY.md.

Code Quality

Severity: P1 [Newly identified]. Impact: the sticky-finding matcher still only understands *.py locations. _LOCATION_PATTERN only extracts Python paths, and _finding_keys() only strips inline *.py:Lnn references from summaries. For findings on .md files like this PR’s ai-review-local.md, a pure line-number shift can still create one new open finding and one false addressed finding, which pollutes review-state.json and biases later re-reviews. I reproduced this by parsing/merging two identical .md findings that differed only by line number. Location: _LOCATION_PATTERN, location extraction loop, _finding_keys(), and the current regression only covering .py shifts in tests/test_openai_review.py#L1176. Concrete fix: generalize location parsing and summary normalization to repo-relative files generically, not just .py, and add a regression proving .md:L10 and .md:L20 merge as a single open finding.

Performance

Severity: P2. Impact: rerun-mode budget enforcement is understated because mandatory_est omits both the structured findings table and delta_changed_files_text, even though compile_prompt() appends them in delta mode. That means --token-budget can materially undercount rerun prompts and retain import-context blocks that should have been dropped, undermining the PR’s new cost-control feature. Location: mandatory_est calculation, structured findings table, and delta changed-files section. Concrete fix: compute the budget from a precompiled prompt without import-context, or explicitly add the structured-findings and delta-changed-files text to mandatory_est before calling apply_token_budget().

Maintainability

No findings.

Tech Debt

No findings. TODO.md#L44 does not currently track the open issues above, so they remain unmitigated for assessment purposes.

Security

No findings.

Documentation/Tests

No standalone findings beyond the missing regressions implied above. The current additions cover Python-file line shifts and basic token-budget behavior, but not non-.py finding churn or delta-mode budget accounting, in tests/test_openai_review.py#L656 and tests/test_openai_review.py#L1176.

Path to Approval

Broaden finding parsing/fingerprinting so non-.py locations are normalized the same way as .py locations, and add a regression that proves a .md line shift preserves a single open finding across reruns.

P1: Broaden _LOCATION_PATTERN and summary normalization in _finding_keys() from .py-only to any file extension (\.\w+). Findings on .md, .yml, etc. now correctly merge across line shifts without false addressed/open churn. P2: Include structured_findings table and delta_changed_files_text in mandatory_est for token budget calculation, so deep-context import files are correctly dropped when rerun prompts exceed the budget. Add regression test for .md file line-shift merge stability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T17:41:16Z

/ai-review

github-actions · 2026-03-22T17:51:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8f10ff0f98ad158356928bb00482d8f39fe5b195

Overall Assessment

⛔ Blocker — one unmitigated P0 security issue remains, along with two P1 workflow/robustness issues.

Executive Summary

No estimator, weighting, variance/SE, identification, or default-behavior methodology changed in this diff, so I found no REGISTRY or paper-adherence defects.
The prior sticky-finding line-shift issue for non-.py files appears addressed; the matcher now handles generic .<ext> locations and there is a .md regression test.
The prior delta-budget undercount is materially improved; mandatory_est now includes delta changed-files text and structured findings.
Severity P0: the new --include-files path handling is not confined to the repo root, so a malicious or mistaken invocation can upload arbitrary local files to OpenAI.
Severity P1: --force-fresh deletes prior review state and then suppresses --review-state, so the fresh run never becomes the new delta baseline.
Severity P1: malformed review-state.json is still not handled end-to-end; the skill and script both assume happy-path JSON structure and can abort reruns instead of starting fresh.

Methodology

No findings. I checked docs/methodology/REGISTRY.md; this PR only changes local review tooling, prompt criteria, and tests, not any estimator implementation or documented methodology deviation.

Code Quality

Severity: P1. Impact: --force-fresh currently wipes prior state and then tells Step 5 not to pass --review-state/--commit-sha, so the “fresh” review cannot write a new baseline. The next rerun therefore falls back to another full review instead of using delta-diff + sticky findings, which breaks the new feature’s expected parameter interaction. Location: .claude/commands/ai-review-local.md:L210-L217, .claude/commands/ai-review-local.md:L268-L298. Concrete fix: on --force-fresh, still pass --review-state, --commit-sha, and --base-ref so the current fresh review seeds a new baseline; only suppress --previous-review and delta arguments.
Severity: P1. Impact: malformed review-state is not actually handled as “start fresh.” The skill reads .claude/reviews/review-state.json with raw json.load(...).get(...) shell snippets, and the script’s parse_review_state() only checks JSON syntax/schema version, not root/object types. A truncated file, non-dict root, or non-list findings payload can therefore abort the rerun path before or during merge_findings() instead of degrading cleanly. Location: .claude/commands/ai-review-local.md:L221-L225, .claude/scripts/openai_review.py:L353-L379, .claude/scripts/openai_review.py:L1459-L1467. Concrete fix: centralize state parsing/validation in Python, require dict + list[dict] + integer review_round, catch JSON/type errors, and have the skill stop shell-parsing JSON directly.

Performance

No unmitigated findings. The earlier rerun token-budget omission looks addressed by the added delta_changed_files_text and structured_findings terms in .claude/scripts/openai_review.py:L1360-L1367.

Maintainability

No separate findings beyond the state-parsing duplication already called out under Code Quality.

Tech Debt

No findings. TODO.md does not currently track the open issues above, so they are not mitigated by deferred-work tracking. Location: TODO.md:L44-L76.

Security

Severity: P0. Impact: --include-files is documented as repo-relative context, but the implementation accepts any path containing / and feeds it through os.path.join(args.repo_root, name). That allows both parent traversal (../...) and absolute paths (/etc/passwd), and the resulting files are then read into the prompt and sent to OpenAI. The Step 3c secret scan repeats the same permissive resolution and only does pattern-based scanning, which does not protect arbitrary sensitive files. This is a direct local-file exfiltration path. Location: .claude/scripts/openai_review.py:L1311-L1325, .claude/commands/ai-review-local.md:L175-L180. Concrete fix: reject absolute paths, normalize candidates with realpath, require them to stay under the repo root, and apply the same containment check in the Step 3c scan logic before any file is read or uploaded.

Documentation/Tests

Severity: P2. Impact: the new test expansion is broad, but it still misses regressions for the three highest-risk paths introduced here: repo-root confinement for --include-files, malformed review-state.json fallback, and --force-fresh baseline persistence. That leaves the bugs above unguarded. Location: tests/test_openai_review.py:L656-L1232. Concrete fix: add targeted unit/integration tests for /etc/passwd/../.env rejection, malformed/non-dict/non-list review-state handling, and the “force fresh, then rerun” delta-baseline flow.
I could not run python3 -m pytest tests/test_openai_review.py -q here because pytest is not installed in this environment.

Path to Approval

Constrain --include-files to repo-root descendants only in both the script and the Step 3c scan; reject absolute paths and .. traversal, and add regressions for /etc/passwd and ../.env.
Change the --force-fresh workflow so it clears stale state but still passes --review-state, --commit-sha, and --base-ref, then add a regression proving the next rerun uses delta mode from that fresh baseline.
Harden review-state parsing end-to-end: replace the raw shell json.load(...).get(...) reads with a safe helper, validate review-state.json structure in parse_review_state(), and add malformed-JSON / non-dict-root / non-list-findings regressions.

P0: Confine --include-files to repo root. Reject absolute paths upfront, normalize with realpath, verify candidate stays under repo_root + os.sep. Prevents ../ traversal and arbitrary local file exfiltration to OpenAI. P1: --force-fresh now still passes --review-state/--commit-sha/--base-ref so the fresh run seeds a new delta baseline. Only --previous-review and delta args are suppressed. P1: Harden parse_review_state() with type validation: non-dict root, non-list findings, non-int review_round all degrade to ([], 0) instead of aborting. Skill-side JSON reads wrapped in try/except. P2: Add 5 regression tests: path traversal detection, non-dict root, non-list findings, non-int round, absolute path rejection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T18:14:58Z

/ai-review

github-actions · 2026-03-22T18:22:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e3585ebe7b790051f82a72e5492fefc147b6a22c

Overall Assessment

⚠️ Needs changes

Executive Summary

No estimator, weighting, variance/SE, identification, or default-behavior code changed in this diff. I checked docs/methodology/REGISTRY.md and found no methodology defects.
The prior --include-files repo-root confinement P0 appears fixed in the script at .claude/scripts/openai_review.py:1331.
The prior --force-fresh baseline-seeding P1 appears fixed in the skill instructions at .claude/commands/ai-review-local.md:210 and .claude/commands/ai-review-local.md:290.
The prior malformed review-state.json P1 is only partially fixed. State validation is still not end-to-end: the skill can still reuse incompatible state to drive delta mode, and the script still accepts findings as any list, so malformed entries later crash on f.get(...).
[Newly identified] Sticky finding tracking still has a correctness gap: multiple prior location-less findings with the same severity/summary can collapse into one current match, silently losing one status record.
I could not run pytest; this environment does not have pytest installed.

Methodology

No findings. This PR only changes local review tooling, prompt criteria, and tests, not estimator implementations or documented methodology deviations in docs/methodology/REGISTRY.md.

Code Quality

Severity P1. Impact: the prior malformed-state issue remains unresolved end-to-end. Step 4 still shell-parses .claude/reviews/review-state.json and ignores schema validation, so a schema-mismatched state file can still drive delta-only review instead of a true fresh review; Step 5 also gates --previous-review on file existence alone, so stale .claude/reviews/local-review-previous.md can leak into a fresh run. In the script, parse_review_state() only checks that findings is a list, but later code assumes every element is a dict and calls f.get(...); a payload like {"schema_version": 1, "findings": ["oops"]} passes parsing and then crashes in compile_prompt() / merge_findings(). Locations: .claude/commands/ai-review-local.md:222, .claude/commands/ai-review-local.md:291, .claude/scripts/openai_review.py:387, .claude/scripts/openai_review.py:906, .claude/scripts/openai_review.py:624. Concrete fix: centralize review-state loading/validation in Python before delta generation, require findings to be list[dict] with minimally valid finding objects, and only pass --previous-review when validated delta mode is active.
Severity P2. Impact: [Newly identified] merge_findings() is not one-to-one when previous findings lack locations. In pass 2b, every unmatched previous no-location finding is consumed whenever there is exactly one current fallback candidate, so two prior findings with the same severity/summary can both match one current finding. I reproduced this locally with two prior no-location P1 findings sharing the same summary across different sections; the function returned only the current open finding and dropped the addressed record entirely. Location: .claude/scripts/openai_review.py:671. Concrete fix: include section in the fallback key and/or track consumed current candidates in pass 2b so one current finding can match at most one previous finding.

Performance

No findings. The earlier delta-budget undercount looks addressed; mandatory token estimation now includes delta diff, delta changed-files, and structured findings at .claude/scripts/openai_review.py:1388.

Maintainability

No separate findings beyond the duplicated state-validation logic already called out under Code Quality.

Tech Debt

No findings. The remaining issues are not tracked under “Tech Debt from Code Reviews” in TODO.md:44, so they are not mitigated by deferred-work tracking.

Security

No unmitigated findings. The prior local-file exfiltration issue appears fixed by the realpath + repo-root containment check in .claude/scripts/openai_review.py:1331. I did not find a remaining P0 in the current diff.

Documentation/Tests

Severity P2. Impact: the new tests do not cover the remaining failure modes. The malformed-state tests stop at top-level type checks and never exercise non-dict findings elements, schema-mismatched state flowing through Step 4, or stale local-review-previous.md contamination when review-state.json is absent. The include-files “absolute path” test also does not assert script behavior; it only asserts that /etc/passwd is absolute. Locations: tests/test_openai_review.py:714, tests/test_openai_review.py:1267, tests/test_openai_review.py:974. Concrete fix: add regressions for findings=["oops"], old-schema state being rejected before delta-mode setup, stale previous-review file suppression on fresh runs, and duplicate no-location finding matching.

Path to Approval

Move review-state validation out of the shell snippets in Step 4 and into a single Python validation path that returns either “validated delta mode” or “fresh review”; reject schema-mismatched/invalid state before any delta diff or previous-review args are prepared.
Tighten parse_review_state() so findings must be list[dict] before compile_prompt() or merge_findings() ever touch it, and treat invalid element types as “start fresh.”
Change Step 5 so --previous-review is keyed off an explicit validated-delta flag, not just the presence of .claude/reviews/local-review-previous.md.
Make fallback matching in merge_findings() one-to-one, or include section in the fallback key, and add a regression with two prior no-location findings sharing the same summary.
Add tests for the invalid-state and duplicate-finding cases above; I could not run the suite here because pytest is not installed.

P1: Filter non-dict elements from findings list in parse_review_state(). Entries like strings or integers are silently dropped instead of crashing downstream code that calls f.get(...). P2: Track consumed current candidates in merge_findings() pass 2b so two prior no-location findings can't both match the same current finding. Ensures one-to-one matching in the reverse fallback pass. P2: Add 2 regression tests: non-dict findings filtering, duplicate no-location findings one-to-one matching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T18:28:37Z

/ai-review

github-actions · 2026-03-22T18:34:51Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 2df9a88795c8be08ba56120cffcb74361579f8d3

Overall assessment

⚠️ Needs changes

Executive summary

No estimator, weighting, variance/SE, identification, or default-behavior code changed. I checked docs/methodology/REGISTRY.md and found no methodology defects in scope.
The prior --include-files repo-root confinement issue appears fixed in .claude/scripts/openai_review.py:L1341-L1378.
The prior duplicate no-location matching issue appears fixed by the one-to-one fallback logic in .claude/scripts/openai_review.py:L674-L699, with regression coverage in tests/test_openai_review.py:L1087-L1107.
One prior P1 remains open: review-state validation is still not end-to-end. The skill can still enter delta mode from an incompatible state file, and the parser still accepts malformed dict-shaped findings that can silently corrupt sticky finding tracking.
I could not run pytest; this environment does not have it installed.

Methodology

No findings. The diff is confined to local review tooling, prompt criteria, and tests, and does not touch estimator implementations or any documented methodology deviations in docs/methodology/REGISTRY.md.

Code Quality

Severity P1. Impact: review-state.json validation is still only partial. In the skill, Step 4 reuses state based on branch/base/ancestor checks alone, so a schema-mismatched file can still generate delta files and preserve .claude/reviews/local-review-previous.md, even though the Python parser treats that same file as “start fresh”; Step 5 then keys --previous-review off file existence instead of an explicit validated-delta flag. Separately, parse_review_state() still accepts arbitrary dict-shaped findings, and merge_findings() consumes findings by id with "" as the fallback key, so malformed schema-version-1 state can silently lose one prior finding during matching. Locations: .claude/commands/ai-review-local.md:L220-L286, .claude/commands/ai-review-local.md:L290-L317, .claude/commands/ai-review-local.md:L419-L421, .claude/scripts/openai_review.py:L353-L402, .claude/scripts/openai_review.py:L614-L706. Concrete fix: centralize state validation in Python and return an explicit “validated delta mode” decision only when schema_version, branch/base, commit ancestry, and required finding fields (id, severity, summary, status) all pass; gate both delta generation and --previous-review on that decision, otherwise ignore/delete stale prior-review artifacts and run fresh.

Performance

No findings.

Maintainability

No separate findings beyond the duplicated/partial state-validation flow called out above.

Tech Debt

No mitigating TODO entry exists for the remaining state-validation issue under TODO.md:L44-L75, so it remains unmitigated.

Security

No unmitigated findings. The prior local-file inclusion issue appears fixed by the realpath plus repo-root containment check in .claude/scripts/openai_review.py:L1341-L1378.

Documentation/Tests

Severity P2. Impact: the new tests do not pin down the remaining stale-state paths. There is coverage for top-level schema mismatch and non-dict filtering, but no regression that a schema-mismatched state suppresses delta/--previous-review, and no test that dict findings missing required keys are rejected wholesale instead of being merged. The --include-files confinement tests also still assert path properties rather than exercising main()’s rejection branch. Locations: tests/test_openai_review.py:L733-L782, tests/test_openai_review.py:L988-L1164, tests/test_openai_review.py:L1303-L1318. Concrete fix: add CLI-level or helper-level tests that assert fresh-mode fallback on schema mismatch, reject malformed finding dicts missing required keys, and execute the actual --include-files validation path.

Path to Approval

Add a single Python validator for review-state.json that checks schema version, branch/base, commit ancestry, and required finding fields before any delta files or previous-review artifacts are reused.
Change Step 5 so --previous-review, --delta-diff, and --delta-changed-files are passed only when that validator explicitly says delta mode is safe.
Tighten parse_review_state() to reject malformed finding dicts rather than filtering only by isinstance(f, dict).
Add regressions for the two cases above. I could not run the suite here because pytest is not installed.

… validation P1: Require id/severity/summary/status keys in finding dicts — malformed entries are filtered in parse_review_state() before reaching merge/compile. P1: Add validate_review_state() function that checks schema version, branch/base match, and required fields in a single call. Returns (findings, round, commit, is_valid) tuple. P1: Replace shell JSON parsing in skill Step 4 with Python validator call. Delta mode and --previous-review are now gated on the is_valid flag from the centralized validator, not raw file existence. P2: Add 5 regression tests: missing-key filtering, validate_review_state valid/branch-mismatch/schema-mismatch/missing-file cases. Fix existing tests to include all required finding keys. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T18:43:10Z

/ai-review

igerber · 2026-03-22T20:38:02Z

/ai-review

github-actions · 2026-03-22T20:45:11Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c92f14f3c929a666d642da53628df43f30b420f3

Overall assessment

⚠️ Needs changes

Executive summary

No estimator, weighting, variance/SE, identification, or default-behavior code changed. I checked docs/methodology/REGISTRY.md and found no methodology defect in scope.
The prior --include-files repo-root confinement issue appears fixed by the absolute-path rejection plus realpath containment check in .claude/scripts/openai_review.py:L1389-L1427.
The prior duplicate no-location matching issue appears fixed by the one-to-one fallback logic in .claude/scripts/openai_review.py:L699-L747, with targeted regressions in tests/test_openai_review.py:L1108-L1147.
One prior P1 remains open: re-review state still is not fully fail-closed. Fresh runs can still inherit stale previous-review text, and malformed stored findings are still filtered instead of invalidating delta mode.
Coverage improved substantially, but the remaining stale-state paths are still untested. I could not run pytest here because pytest is not installed.

Methodology

No findings. The diff is confined to local review tooling, prompt criteria, and tests; it does not modify estimator implementations, weighting, variance/SE code, identification checks, or defaults in diff_diff/.

Code Quality

Severity P1. Impact: fresh reviews can still be contaminated by stale prior-review text. Step 4 only validates/rebuilds state when .claude/reviews/review-state.json exists, but Step 5 says to pass --previous-review whenever .claude/reviews/local-review-previous.md exists. If the JSON state is missing but the previous-review file lingers from an earlier run, the next review is still framed as a follow-up review instead of a fresh one. Locations: .claude/commands/ai-review-local.md:L223-L305. Concrete fix: introduce an explicit delta_mode_validated flag that is set only after state validation and ancestor checks succeed; delete or ignore .claude/reviews/local-review-previous.md whenever that flag is false, and gate --previous-review, --delta-diff, and --delta-changed-files on that flag only.
Severity P1. Impact: validate_review_state() still does not actually fail closed on malformed finding dicts. parse_review_state() filters out entries missing id, severity, summary, or status, but validate_review_state() then returns True as long as schema/branch/base/commit pass, so malformed findings silently disappear from sticky tracking while delta mode still activates. Locations: .claude/scripts/openai_review.py:L394-L450. Concrete fix: if any element of findings is non-dict or missing required keys, treat the entire state as invalid for delta mode and run fresh.

Performance

No findings.

Maintainability

No separate findings beyond the state-validation/gating issues above.

Tech Debt

No mitigating TODO.md entry applies. The open re-review-state issue is not tracked under TODO.md:L44-L75, and TODO tracking would not mitigate a P1 in any case.

Security

No unmitigated findings. The prior local-file inclusion concern appears fixed by the path confinement logic in .claude/scripts/openai_review.py:L1389-L1427.

Documentation/Tests

Severity P2. Impact: the new tests do not pin down the remaining stale-state failure modes. TestValidateReviewState covers valid/branch/schema/missing-file cases but not malformed-finding invalidation, and TestIncludeFilesConfinement never exercises the actual --include-files rejection branch. There is also no regression for the stale .claude/reviews/local-review-previous.md plus missing review-state.json path. Locations: tests/test_openai_review.py:L788-L804, tests/test_openai_review.py:L1324-L1386. Concrete fix: add tests that assert validate_review_state() returns False when any stored finding lacks required keys, suppress --previous-review when review-state.json is missing or invalid, and drive the real main() include-files warning path through the parser/helper logic.

Path to Approval

Add an explicit validated-delta flag in .claude/commands/ai-review-local.md and use it as the sole gate for --previous-review, --delta-diff, and --delta-changed-files; clear local-review-previous.md whenever delta mode is not active, including when review-state.json is absent.
Make validate_review_state() fail closed on any malformed findings entry instead of returning True after filtering.
Add regressions for malformed-finding invalidation, stale previous-review suppression on fresh runs, and the actual --include-files rejection branch.

…ing test P1: validate_review_state() now fails closed on ANY malformed finding — validates raw findings for required keys BEFORE returning is_valid=True. Malformed entries invalidate delta mode entirely instead of being silently filtered. P1: Step 4 now deletes local-review-previous.md whenever delta mode is NOT active (including when review-state.json is missing entirely). Prevents stale previous-review from leaking into fresh reviews. P2: Add regression test asserting malformed findings return is_valid=False. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T20:51:43Z

/ai-review

github-actions · 2026-03-22T20:57:02Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d8a520ee4b5b02dc99f4257a2b5581ad66ec77f4

Overall assessment

⚠️ Needs changes

Executive summary

No estimator, weighting, variance/SE, identification, or default-behavior code changed. The diff is confined to local review tooling, prompt text, and tests, so I found no methodology defect in scope after checking docs/methodology/REGISTRY.md.
The two prior P1s from the last review look resolved: stale previous-review leakage is now gated out in .claude/commands/ai-review-local.md:L223 and malformed stored findings now disable delta mode in .claude/scripts/openai_review.py:L409.
One new P1 remains in the new sticky finding-tracking path: common Markdown list formats like 1. Severity: P1 or * **Severity:** P1 are treated as “no findings” with parse_uncertain=False, so prior open findings get marked addressed and the review baseline advances incorrectly.
The new tests are much better than before, but they still miss the parser formats above and still do not exercise the real --include-files rejection path.
I could not run pytest here because pytest is not installed in this environment.

Methodology

No findings. Methods affected: none. The PR does not touch diff_diff/ estimator implementations, docstring references, weighting, variance/SE code, identification checks, or defaults, so there is no REGISTRY-linked methodology change to cross-check.

Code Quality

Severity P1. Impact: sticky re-review tracking is not fail-closed for common Markdown list markers. _BLOCK_START only recognizes bare or --prefixed severity lines, so outputs such as 1. Severity: P1 and * **Severity:** P1 produce findings=[] and parse_uncertain=False; main() then merges that empty result and marks previously open findings addressed, advancing review-state.json even though the issue is still present. I reproduced this directly against the new helpers. Locations: .claude/scripts/openai_review.py:L492, .claude/scripts/openai_review.py:L531, .claude/scripts/openai_review.py:L1556. Concrete fix: expand block-start parsing and uncertainty detection to accept common Markdown list prefixes (*, +, 1./numbered lists), or otherwise fail closed whenever severity-bearing lines remain unparsed so state is not advanced on ambiguous output.

Performance

No findings.

Maintainability

No separate findings beyond the parser/state-coupling issue above.

Tech Debt

No mitigating TODO.md entry applies. The remaining parser/state issue is not tracked under TODO.md:L44, and in any case a P1 would not be mitigated by TODO tracking.

Security

Severity P2. Impact: the new full-file secret-scan step says it scans only files “that would be uploaded,” but its --include-files handling does not mirror the script’s absolute-path and repo-root confinement. A user-supplied absolute or traversal path can still be read by the local scan even though the script later skips it, which is a privacy/correctness mismatch in the pre-upload check. Location: .claude/commands/ai-review-local.md:L175. Concrete fix: apply the same absolute-path rejection plus realpath repo-root containment used in .claude/scripts/openai_review.py:L1405 before appending --include-files entries to upload_scan_files.

Documentation/Tests

Severity P2. Impact: the parser tests cover bare and --prefixed severity formats, but not numbered or *-prefixed Markdown lists, which is exactly the gap behind the P1 above. The --include-files confinement tests also remain logic-only checks rather than exercising the actual rejection branch. Locations: tests/test_openai_review.py:L879, tests/test_openai_review.py:L1393. Concrete fix: add regressions for 1. Severity: P1 and * **Severity:** P1, plus an end-to-end parse/merge assertion that unresolved findings stay open, and drive the real --include-files rejection path through main() or a factored helper.

Path to Approval

Update parse_review_findings() so common Markdown list formats (*, +, numbered lists) are either parsed correctly or treated as parse-uncertain.
Add a regression proving that a still-open prior finding is not marked addressed when the new review uses one of those formats, and that review-state.json is not advanced on ambiguous output.

P1: _BLOCK_START now accepts *, +, and numbered list prefixes (1., 2.) in addition to - bullets. Prevents parse_uncertain=False on common Markdown list formats like "1. Severity: P1" or "* **Severity:** P1". P2: Step 3c --include-files scan now mirrors the script's absolute-path rejection and realpath repo-root containment check before scanning. P2: Add 3 regression tests: numbered-list severity parsing, starred-bold severity parsing, numbered-bold uncertainty detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T21:02:24Z

/ai-review

github-actions · 2026-03-22T21:07:25Z

🔁 AI review rerun (requested by @igerber)

Head SHA: aab90629b00b785d8b7d60775eb63d2bc2251710

Overall Assessment

⚠️ Needs changes

Executive Summary

No estimator, weighting, variance/SE, identification, or default-behavior code changed. I checked docs/methodology/REGISTRY.md and found no in-scope methodology defect.
The previously reported re-review fixes are largely in place: stale previous-review leakage is gated, malformed stored findings disable delta mode, and 1. Severity: P1 / * **Severity:** P1 now parse correctly.
One P1 remains in the sticky finding-tracking path: mixed-format outputs still do not fail closed. If a review mixes a supported finding with an unsupported severity form like - **P1:** ..., the parser returns uncertain=False, merge_findings() marks the prior finding addressed, and review-state.json advances incorrectly.
A residual P2 remains in the full-file secret-scan instructions: Step 3c still does not exactly mirror the script’s --include-files resolution for slash-containing relative paths, so it can scan files that the script would not upload.
Tests improved materially, but they still miss the mixed supported/unsupported parser case and the real --include-files rejection path. I could not run pytest here because pytest is not installed in this environment.

Methodology

No findings. Methods affected: none. The diff is confined to local review tooling, prompt text, and tests; it does not modify estimator implementations in diff_diff/, weighting, variance/SE, identification checks, or defaults.

Code Quality

Severity P1. Impact: sticky re-review tracking is still not fail-closed on partial parses. _BLOCK_START only recognizes **P1** and Severity:-labeled variants, so common Markdown forms such as - **P1:** ... or - P1: ... are ignored. Because parse_uncertain is only set when zero findings are parsed, a mixed output can parse one supported finding, return uncertain=False, and then merge_findings() will mark an omitted prior finding addressed; main() will write that incorrect state and advance the baseline. I reproduced this locally with **P2** ... plus - **P1:** Missing NaN guard ..., which parsed only the P2 and marked the old P1 addressed. Locations: [.claude/scripts/openai_review.py:L495-L655](/home/runner/work/diff-diff/diff-diff/.claude/scripts/openai_review.py#L495), [.claude/scripts/openai_review.py:L681-L775](/home/runner/work/diff-diff/diff-diff/.claude/scripts/openai_review.py#L681), [.claude/scripts/openai_review.py:L1560-L1592](/home/runner/work/diff-diff/diff-diff/.claude/scripts/openai_review.py#L1560). Concrete fix: either add support for P1: bullet forms in _BLOCK_START, or make parsing fail closed whenever any non-skipped severity-like line remains unparsed after block extraction so review-state.json is not advanced on partial parses.

Performance

No findings.

Maintainability

No separate findings beyond the parser/state-coupling issue above.

Tech Debt

No findings. I checked TODO.md; there is no existing tracked item that mitigates the remaining parser/state issue, and in any case the P1 above would not be downgraded by TODO tracking.

Security

Severity P2. Impact: the new full-file secret scan still does not exactly mirror the script’s --include-files resolution semantics. In Step 3c, slash-containing inputs are first probed under diff_diff/, so visualization/_common.py gets scanned as diff_diff/visualization/_common.py; the script later treats the same value as repo-root-relative and skips it as not found. That means the skill can still scan a file that is not actually “uploaded to OpenAI,” which is a smaller residual form of the previous privacy/correctness mismatch. Locations: [.claude/commands/ai-review-local.md:L175-L188](/home/runner/work/diff-diff/diff-diff/.claude/commands/ai-review-local.md#L175), [.claude/scripts/openai_review.py:L1408-L1449](/home/runner/work/diff-diff/diff-diff/.claude/scripts/openai_review.py#L1408). Concrete fix: make Step 3c use the exact same branching as the script: only bare filenames should resolve under diff_diff/; any input containing / should resolve only relative to the repo root unless the user explicitly writes diff_diff/....

Documentation/Tests

Severity P2. Impact: the parser tests cover the newly supported Severity: forms and the total-failure parse_uncertain path, but they still miss the mixed supported/unsupported case that reproduces the P1 above, and the --include-files confinement tests still do not execute the real code path. Locations: [tests/test_openai_review.py:L885-L984](/home/runner/work/diff-diff/diff-diff/tests/test_openai_review.py#L885), [tests/test_openai_review.py:L1308-L1335](/home/runner/work/diff-diff/diff-diff/tests/test_openai_review.py#L1308), [tests/test_openai_review.py:L1410-L1425](/home/runner/work/diff-diff/diff-diff/tests/test_openai_review.py#L1410). Concrete fix: add a regression where one finding uses a supported format and another uses - **P1:** ... or - P1: ..., then assert the prior finding remains open and state is not advanced; also factor the include-file resolution into a helper or exercise main() so the actual rejection/selection branch is tested.

Path to Approval

Make parse_review_findings() fail closed on partial parses, not just total parse failure. A concrete acceptable fix is to treat unparsed severity-like lines such as - **P1:** ... and - P1: ... as parse_uncertain=True, or to parse those forms directly.
Add a regression that feeds a previous open P1 plus a new review containing one supported finding and one unsupported severity bullet, and assert that the previous finding is not marked addressed and review-state.json is not advanced.

…tion P1: Add bold-colon (**P1:**) and bare-colon (P1:) severity formats to _BLOCK_START. These are common review output patterns that were previously missed, causing mixed-format reviews to falsely mark prior findings addressed. P1: parse_uncertain now detects partial parses (some findings parsed but severity lines remain unparsed) by comparing severity-line count vs findings count. Prevents state advancement on mixed-format output. P2: Step 3c --include-files now mirrors script's exact branching: only bare filenames resolve under diff_diff/; slash-containing paths resolve repo-root relative only. P2: Add 3 regression tests: bold-colon format, bare-colon format, and mixed-format review where both formats should parse. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T23:15:21Z

/ai-review

igerber · 2026-03-22T23:15:49Z

/ai-review

github-actions · 2026-03-22T23:23:05Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b8912dce6c4721d3a7257e281f9b47f497b40a41

Overall assessment

⚠️ Needs changes

Executive summary

No estimator, weighting, variance/SE, identification-assumption, or default-behavior code changed. I checked docs/methodology/REGISTRY.md; methods affected: none.
The re-review hardening is improved: stale-state validation, branch/base gating, and the previously flagged - **P1:** / - P1: parsing cases are addressed.
One P1 remains in the sticky finding-tracking path: parse_review_findings() still does not fail closed on plausible severity blocks such as Severity: **P1**, so re-reviews can silently mark prior findings addressed and advance review-state.json incorrectly.
Test coverage improved materially, but it still misses that concrete parser regression and still does not exercise the real --include-files resolution path.
I could not run pytest here because pytest is not installed in this environment.

Methodology

No findings. The diff is confined to local review tooling, prompt text, and tests; no estimator implementation or methodology-facing behavior in diff_diff/ changed.

Code Quality

Severity P1. Impact: the prior sticky-state parser bug is only partially fixed. _BLOCK_START still does not recognize Severity: **P1** / - Severity: **P1**, and parse_uncertain only counts lines matched by _BLOCK_START. In a mixed-format review, the unsupported severity line is swallowed into the previous block, the following Impact: / Location: lines are misattributed, uncertain=False, and merge_findings() can mark the prior finding addressed even though it is still present. Direct reproduction against the current code yields exactly that behavior. Locations: [.claude/scripts/openai_review.py:L495-L660](/home/runner/work/diff-diff/diff-diff/.claude/scripts/openai_review.py#L495), [.claude/scripts/openai_review.py:L686-L780](/home/runner/work/diff-diff/diff-diff/.claude/scripts/openai_review.py#L686), [.claude/scripts/openai_review.py:L1563-L1597](/home/runner/work/diff-diff/diff-diff/.claude/scripts/openai_review.py#L1563). Concrete fix: either parse Severity: **P0-P3** explicitly, or treat any line beginning with Severity: as a block boundary and set parse_uncertain=True if it cannot be fully parsed before merging/writing state.

Performance

No findings.

Maintainability

No separate findings beyond the parser/state-coupling issue above.

Tech Debt

No findings. TODO.md does not contain an existing tracked item that mitigates the P1 parser/state bug, so it remains unmitigated.

Security

No findings. The earlier slash-path --include-files confinement mismatch appears corrected in both the skill instructions and the script path-resolution branch.

Documentation/Tests

Severity P2. Impact: the new tests still miss the actual failing parser shape and still do not hit the real include-file code path. Parser coverage includes Severity: P1, **Severity:** P1, and **Severity: P1**, but not Severity: **P1**, which is the concrete unresolved case above. The --include-files tests only assert conceptual path properties and never invoke main() or a dedicated helper, so they cannot catch drift in the real resolution logic. Locations: [tests/test_openai_review.py:L912-L1011](/home/runner/work/diff-diff/diff-diff/tests/test_openai_review.py#L912), [tests/test_openai_review.py:L1335-L1363](/home/runner/work/diff-diff/diff-diff/tests/test_openai_review.py#L1335), [tests/test_openai_review.py:L1437-L1452](/home/runner/work/diff-diff/diff-diff/tests/test_openai_review.py#L1437). Concrete fix: add a regression with mixed supported output plus Severity: **P1** asserting either correct two-finding parsing or parse_uncertain=True, and add an end-to-end state-merge assertion that the prior P1 remains open; factor include-file resolution into a helper and test accepted/rejected paths through the real branch.

Path to Approval

Make parse_review_findings() fail closed for Severity: lines whose severity value is bolded, including mixed-format outputs where a supported finding precedes Severity: **P1**.
Add regression coverage for that exact case and verify that review-state.json is not advanced and prior findings are not marked addressed when the parser cannot fully and unambiguously extract the findings.

P1: The plain Severity: pattern now accepts optional ** around the severity value via \*{0,2} quantifier. Handles Severity: P1, Severity: **P1**, and Severity: `P1` uniformly in a single regex branch. P2: Add regression test for Severity: **P1** format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T23:47:50Z

/ai-review

github-actions · 2026-03-22T23:53:08Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cbff58188a0bd5c8ec9a05afee011c936c69f644

Overall Assessment

⚠️ Needs changes

Executive Summary

I checked docs/methodology/REGISTRY.md. This PR only changes local review tooling, prompt text, and tests; no estimator, weighting, variance/SE, identification, or default-behavior code in diff_diff/ changed. Methods affected: none.
The prior re-review parser bug is fixed: _BLOCK_START now recognizes Severity: **P1**-style blocks in .claude/scripts/openai_review.py:L495-L501, with regression coverage in tests/test_openai_review.py:L912-L917.
Severity P1. The new skip heuristics are too broad: legitimate findings whose summary text contains phrases like Path to Approval, Needs changes, or Looks good are dropped entirely, parse_uncertain stays False, and reruns can silently mark the finding addressed in review-state.json. See .claude/scripts/openai_review.py:L512-L531, .claude/scripts/openai_review.py:L536-L660, and .claude/scripts/openai_review.py:L1563-L1597.
Severity P2. Delta-mode token budgeting/cost estimation double-counts the full-branch changed_files_text, which can unnecessarily drop import-context and overstate the dry-run estimate. See .claude/scripts/openai_review.py:L1025-L1037 versus .claude/scripts/openai_review.py:L1472-L1496.
Severity P2. The new tests still cover helpers in isolation rather than the real main() include-files/delta-budget path, so the parser/state bug above and the delta-budget drift are not caught. See tests/test_openai_review.py:L660-L706, tests/test_openai_review.py:L847-L1055, and tests/test_openai_review.py:L1444-L1459.

Methodology

No findings. I checked docs/methodology/REGISTRY.md; this PR does not touch methodology-facing library code.

Code Quality

Severity P1. Impact: _should_skip_line() suppresses any severity line containing marker substrings such as Path to Approval, Needs changes, Looks good, Blocker, or the skip phrases in _SKIP_PHRASES. That means a real finding like **P2** The prompt omits the Path to Approval section ... parses as ([], False), and on rerun merge_findings() will mark the prior finding addressed even though it is still present. This is the same sticky-state failure mode as the prior parser bug, just via the new skip filter. Locations: .claude/scripts/openai_review.py:L512-L531, .claude/scripts/openai_review.py:L536-L660, .claude/scripts/openai_review.py:L1563-L1597. Concrete fix: replace substring-based skipping with anchored patterns for actual assessment headers/table rows/instruction lines only, or treat any skipped severity line as parse_uncertain=True so state cannot advance.

Performance

Severity P2. Impact: in delta mode, mandatory_est always includes changed_files_text and then additionally includes delta_changed_files_text, but the compiled delta prompt only emits the delta changed-files block. On large long-lived branches this can drop import-context earlier than intended and inflate the displayed estimate/cost for --dry-run. Locations: .claude/scripts/openai_review.py:L1025-L1037, .claude/scripts/openai_review.py:L1472-L1496. Concrete fix: derive mandatory_est from the same branch of compile_prompt() that will actually be emitted; when delta_diff_text is present, count delta_changed_files_text instead of full-branch changed_files_text.

Maintainability

No findings.

Tech Debt

No separate findings. TODO.md does not currently track either open issue, so neither is mitigated. See TODO.md:L44-L76.

Security

No findings. The new --include-files confinement in .claude/scripts/openai_review.py:L1413-L1458 is correctly rooted with realpath plus repo-prefix checks, and the command doc adds full-file secret-scan guidance in .claude/commands/ai-review-local.md:L159-L211.

Documentation/Tests

Severity P2. Impact: the parser tests only cover skip-only prose and supported severity formats, not legitimate findings whose summaries contain the new skip markers, and the --include-files tests still do not execute the real resolution path in main(). That leaves the P1 above and the delta-budget bug unguarded. Locations: tests/test_openai_review.py:L847-L1055, tests/test_openai_review.py:L1444-L1459. Concrete fix: add an end-to-end rerun test that feeds a finding like **P2** The prompt omits the Path to Approval section... through parse/merge/state writing and asserts it stays open; add a delta-mode budget test where full-branch changed_files_text is large but delta changed-files are small and import-context should still be retained.
I could not run pytest here because pytest is not installed in this environment.

Path to Approval

Narrow _should_skip_line() so it only skips actual assessment/table/instruction lines, not arbitrary findings whose summaries mention phrases like Path to Approval, Needs changes, or Looks good.
Add a regression that reruns parsing plus merge/state update on one of those summaries and verifies the prior finding is not silently marked addressed.

P1: Remove _should_skip_line() from block-start matching. If a line matches _BLOCK_START, it's a finding — trust the severity pattern regardless of summary content. Skip heuristics remain in the uncertainty scanner only. Prevents findings containing "Path to Approval", "Looks good", etc. in their summaries from being silently dropped. P2: Fix delta-mode token budget to not count changed_files_text (which is NOT included in delta-mode prompt, only in fresh reviews). Prevents unnecessary import-context dropping in re-reviews. P2: Add 2 regression tests for findings with skip markers in summaries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-22T23:58:47Z

/ai-review

github-actions · 2026-03-23T00:06:57Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4c1f0ff6216b0f73157a95e0eed4465fcf93e615

Overall Assessment

⚠️ Needs changes

The prior parser and delta-budget issues look fixed, but there is still an unmitigated P1 in the new sticky-finding merge logic.

Executive Summary

Methods affected: none. I checked docs/methodology/REGISTRY.md:L1-L40; this PR only changes local review tooling, prompt text, and tests.
The previous parser/skip-line finding appears resolved: severity block detection is broader now, and skip heuristics only participate in the uncertainty scan, not block collection, in .claude/scripts/openai_review.py:L495-L661.
The previous delta-budget overcount also appears resolved: delta mode no longer adds full-branch changed_files_text to mandatory_est in .claude/scripts/openai_review.py:L1472-L1491.
Severity P1: merge_findings() matches on severity + file + only the first 50 characters of the normalized summary, so distinct same-file findings can collapse into one state entry in .claude/scripts/openai_review.py:L664-L781.
Severity P2: the structured previous-findings table is emitted without escaping | or newline characters, so summaries like str | None can corrupt the table the model sees on rerun in .claude/scripts/openai_review.py:L989-L1006.
Severity P2: the new tests still miss regressions for those two cases, and I could not run pytest here because pytest is not installed.

Methodology

No findings. Methods affected: none. The diff is confined to review tooling/docs/tests rather than estimator, weighting, variance/SE, identification, or default-behavior code, so there is no methodology deviation to cross-check against docs/methodology/REGISTRY.md:L1-L40.

Code Quality

Severity P1. Impact: _finding_keys() truncates the normalized summary to 50 characters before merge_findings() compares prior and current findings. Two different findings in the same file with the same severity and same first 50 summary characters are therefore treated as the same issue; I reproduced this with two foo.py P1 summaries differing only at character 51, and merge_findings() returned a single open finding. That silently drops still-open issues from review-state.json and undermines the documented structured status-tracking flow in .claude/scripts/openai_review.py:L664-L781, .claude/commands/ai-review-local.md:L400-L403, and .claude/commands/ai-review-local.md:L472-L474. Concrete fix: use the full normalized summary, or a stable hash of it, in the primary key and reserve looser matching only for explicitly unique fallback cases.
Severity P2. Impact: compile_prompt() writes summary and location directly into a Markdown table row, so unescaped | or newline characters split columns and corrupt the previous-findings block fed back to the model in delta mode in .claude/scripts/openai_review.py:L989-L1006. Concrete fix: escape table metacharacters/newlines before rendering, or switch this block to a safer format such as bullets or JSON.

Performance

No findings. The prior delta-budget double-count appears fixed in .claude/scripts/openai_review.py:L1472-L1491.

Maintainability

Severity P3. Impact: the displayed previous-review round is computed with string max() over R<n> identifiers, so once reruns reach double digits the prompt can label round 10 as round 9 in .claude/scripts/openai_review.py:L991-L995. Concrete fix: parse round numbers as integers before calling max().

Tech Debt

No separate findings. TODO.md does not track the review-state merge or prompt-table issues above, so they are not mitigated by existing deferred-work tracking in TODO.md:L44-L75.

Security

No findings. The new --include-files confinement rejects absolute paths and .. traversal via realpath plus repo-root prefix checks in .claude/scripts/openai_review.py:L1414-L1459.

Documentation/Tests

Severity P2. Impact: the added tests are broad, but they still miss the two concrete failure modes above: same-file same-prefix summary collisions in merge_findings() and |/newline corruption in the structured findings table. Both regressions would pass the current suite in tests/test_openai_review.py:L847-L1451. Concrete fix: add one merge regression where two findings differ only after character 50 in the same file, and one compile_prompt() regression where summary/location contain | and embedded newlines; ideally exercise one through the main() state-write path. I could not execute pytest here because pytest is not installed in this environment.

Path to Approval

Make finding matching collision-resistant by keying on the full normalized summary, or a stable hash of it, before any truncated/fallback comparison.
Escape or reserialize structured-findings rows so summary and location cannot break the previous-findings table.
Add regressions for both cases in tests/test_openai_review.py, preferably including one end-to-end rerun that writes review-state.json.

P1: Remove 50-char truncation from _finding_keys() summary fingerprint. Uses full normalized summary so distinct findings with same prefix don't collapse into one state entry. Normalization (lowercase + strip file:line references) already handles wording stability. P2: Escape | and newline chars in structured findings table cells to prevent table corruption when summaries contain pipe chars (e.g., "str | None"). P2: Add 2 regression tests: long summary collision prevention and pipe char escaping in findings table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-03-23T01:03:44Z

/ai-review

github-actions · 2026-03-23T01:10:21Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ac93d46e8d7fe9360b15897b35745634e2ad6f70

Overall Assessment

✅ Looks good

Executive Summary

Methods affected: none. The diff is confined to local review tooling, prompt text, and tests, so the methodology check is confirmatory only against docs/methodology/REGISTRY.md:1.
The prior P1 merge-collision issue appears resolved: finding matching now keys on the full normalized summary plus full relative file path in .claude/scripts/openai_review.py:664, with targeted regressions added in tests/test_openai_review.py:1254.
The prior P2 table-corruption issue appears resolved: structured findings rows now escape pipe and newline characters in .claude/scripts/openai_review.py:1001, and pipe escaping is covered in tests/test_openai_review.py:382.
I did not find any new unmitigated P0/P1 issues in the delta-diff, review-state, include-files, or full-context additions.
One previously identified minor issue remains: double-digit reruns are labeled incorrectly because the round header uses string max() in .claude/scripts/openai_review.py:991.
One minor doc inconsistency remains: the Notes section still says reruns activate from local-review-latest.md, but the new Step 4/5 flow actually depends on validated review-state.json state in .claude/commands/ai-review-local.md:468.

Methodology

No findings. I checked docs/methodology/REGISTRY.md:1, and this PR does not touch estimator code, weighting, variance/SE logic, identification checks, or defaults.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3. Impact: the re-review header still computes the previous round with lexicographic string max(), so once the history reaches R10, the prompt can show Round 9 instead of Round 10, which weakens the structured re-review context in .claude/scripts/openai_review.py:991. Concrete fix: parse the R<n> prefix to integers before max() and format the header from the numeric result.

Tech Debt

No findings. TODO.md does not currently track the remaining minor issues, but they are P3-only and do not change the approval outcome; I checked the code-review debt section at TODO.md:44.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the Notes section still says “Re-review mode activates automatically when a previous review exists in .claude/reviews/local-review-latest.md,” but the new flow in Step 4/5 only reuses prior context when review-state.json validates and delta files are generated, so the user-facing note is stale in .claude/commands/ai-review-local.md:477. Concrete fix: update that note to reference validated review-state.json plus delta activation, not local-review-latest.md alone.
Severity P3. Impact: the new prompt-compilation coverage still does not include a double-digit round regression, which is why the remaining round-label bug above is not caught by tests in tests/test_openai_review.py:276. Concrete fix: add a compile_prompt() test with R9 and R10 findings and assert that the header renders Round 10.

I could not execute python3 -m pytest tests/test_openai_review.py -q here because pytest is not installed in this environment.

igerber and others added 4 commits March 22, 2026 10:40

Fix stale merge_findings() docstring to match implementation

68eb149

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Conversation

igerber commented Mar 22, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Overall assessment

Executive summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

igerber commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Overall assessment

Executive summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval