Skip to content

Phase 4: Survey support for ImputationDiD, TwoStageDiD, CallawaySantAnna#233

Open
igerber wants to merge 6 commits intomainfrom
survey-roadmap
Open

Phase 4: Survey support for ImputationDiD, TwoStageDiD, CallawaySantAnna#233
igerber wants to merge 6 commits intomainfrom
survey-roadmap

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Mar 22, 2026

Summary

  • Add weights parameter to solve_logit() — survey weights enter IRLS as w_survey * mu*(1-mu), enabling survey-weighted propensity score estimation
  • Add survey support to ImputationDiD: weighted iterative FE, survey-weighted ATT aggregation, weighted conservative variance (Theorem 3), survey df for inference
  • Add survey support to TwoStageDiD: weighted iterative FE, weighted Stage 2 OLS, weighted GMM sandwich variance with survey weights in both stages
  • Add survey support to CallawaySantAnna: survey-weighted regression, IPW (via weighted solve_logit()), and DR methods with explicit influence functions; survey-weighted WIF in aggregation; Cholesky cache bypassed under survey weights
  • Unblock TripleDifference IPW/DR with survey (weighted solve_logit() now available, removing Phase 3 deferral)
  • 41 new tests covering all estimators, weighted logit, scale invariance, and CS inference validation
  • Update Phase 3 tests: TripleDifference IPW/DR tests changed from "raises" to "works"
  • Update survey-roadmap.md with Phase 4 implementation status
  • Document CS per-cell SE deviation in REGISTRY.md (influence-function variance matching R's did, not full TSL)
  • Log pre-existing Hausman stale n_cl bug from PR EfficientDiD: cluster-robust SEs, last-cohort control, Hausman pretest, small cohort warning #230 in TODO.md

Methodology references (required if estimator / math changes)

  • ImputationDiD: Borusyak, Jaravel & Spiess (2024), "Revisiting Event-Study Designs", RES — Theorem 3 conservative variance with survey-weighted aggregation weights
  • TwoStageDiD: Gardner (2022) / Butts & Gardner (2022) — GMM sandwich with survey weights in cross-products
  • CallawaySantAnna: Callaway & Sant'Anna (2021), "Difference-in-Differences with multiple time periods", JoE — survey weights compose multiplicatively with IPW weights; influence-function-based per-cell SE matching R's did::att_gt() analytical path
  • Weighted logit: Matches R's svyglm(family=binomial) — working weights = survey_w × mu × (1−mu)
  • Intentional deviation: CS per-cell ATT(g,t) SEs use IF variance (not full TSL with strata/PSU/FPC). Design effects enter at aggregation via WIF and survey df. Documented in REGISTRY.md.

Validation

  • Tests added/updated: tests/test_survey_phase4.py (41 tests), tests/test_survey_phase3.py (2 tests updated)
  • Full test suite: 2284 passed, 0 failed
  • Smoke tests verify: uniform weights match unweighted, non-uniform weights change ATT, bootstrap+survey raises NotImplementedError, scale invariance, survey metadata in summary output

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

🤖 Generated with Claude Code

igerber and others added 2 commits March 22, 2026 15:40
…, CallawaySantAnna)

- Weighted solve_logit(): survey weights enter IRLS as w_survey * mu*(1-mu)
- ImputationDiD: weighted iterative FE, survey-weighted ATT aggregation,
  weighted conservative variance (Theorem 3), survey df for inference
- TwoStageDiD: weighted iterative FE, weighted Stage 2 OLS, weighted GMM
  sandwich variance with survey weights in both stages
- CallawaySantAnna: survey-weighted regression, IPW (via weighted solve_logit),
  and DR methods with explicit influence functions; survey-weighted WIF in
  aggregation; Cholesky cache bypassed under survey weights
- Unblock TripleDifference IPW/DR with survey (weighted solve_logit now available)
- 38 new tests in test_survey_phase4.py covering all estimators + scale invariance
- Update survey-roadmap.md, REGISTRY.md with Phase 4 status and deviation notes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e-existing Hausman bug, add inference tests

- Add REGISTRY.md deviation note: CS per-cell ATT(g,t) SEs use influence-function
  variance (matching R's did), not full TSL with strata/PSU/FPC
- Add TODO entries: CS per-cell TSL SE (Medium), Hausman stale n_cl from #230 (Medium)
- Add 3 CS survey inference validation tests (scale invariance, per-cell ATT change,
  survey df effect)
- Remove unused survey_weight_type in ImputationDiD

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link

Overall Assessment

⛔ Blocker

Executive Summary

  • The PR removes the survey guard for TripleDifference IPW/DR, but fit() still routes those methods through the unweighted wrappers, so non-uniform survey weights are silently ignored for both ATT and SE.
  • ImputationDiD documents survey-weighted Theorem 3 variance, but the variance code never receives survey weights and still uses the unweighted v_it formulas/projection matrices.
  • CallawaySantAnna’s new no-covariate survey SE formulas are not influence-function based as documented; they use first-power weighted variances while the stored IFs use weighted deviations.
  • The Methodology Registry still says TripleDifference IPW/DR survey support is deferred, while the PR updates tests/roadmap to claim it is supported.
  • The new DDD survey tests only check “does not raise” / finite outputs, which is why the weighting regression slipped through.

Methodology

  • Severity: P0. Impact: TripleDifference IPW/DR with survey_design still computes unweighted estimates and unweighted SEs. Only the regression path receives survey_weights / resolved_survey; the IPW and DR branches call wrappers that drop both arguments, even though the downstream decomposition code already has survey-aware hooks. That is a silent correctness bug: users can pass non-uniform survey weights and get the wrong numbers with no warning. Concrete fix: either restore the NotImplementedError for IPW/DR+survey, or thread survey_weights and resolved_survey through the IPW/DR call chain and add regression tests that verify non-uniform weights change ATT/SE while uniform weights match the unweighted result. References: diff_diff/triple_diff.py:L557-L570, diff_diff/triple_diff.py:L791-L800, diff_diff/triple_diff.py:L1021-L1047.

  • Severity: P1. Impact: ImputationDiD claims survey-weighted conservative variance, but the survey path only changes the treated aggregation weights passed into _compute_conservative_variance(). The actual Theorem 3 helper still builds v_untreated from unweighted untreated counts in the FE-only case and from unweighted normal equations A_0'A_0 in the covariate case. Since survey_weights never reach these helpers, the reported SE is not the weighted Theorem 3 variance described in the registry. Concrete fix: thread survey weights into the v_it construction itself, not just the treated aggregation weights; for the covariate case that means using the weighted projection system rather than A_0'A_0, and for the FE-only case the closed form needs weighted untreated totals rather than raw counts. References: diff_diff/imputation.py:L447-L473, diff_diff/imputation.py:L1022-L1100, diff_diff/imputation.py:L1196-L1278, docs/methodology/REGISTRY.md:L845-L845.

  • Severity: P1. Impact: CallawaySantAnna’s new no-covariate survey SEs are internally inconsistent with the influence functions the code stores and uses downstream. In the survey branches, the IF is w_norm * (y - mean_w(y)), but the SE is computed from sum(w_norm * (y - mean_w(y))^2) rather than sum((w_norm * (y - mean_w(y)))^2). That makes per-cell se, t_stat, p_value, and conf_int wrong whenever weights are non-uniform, in the common no-covariate regression/IPW/DR paths and the no-covariate fallback path. Concrete fix: derive SEs from the same IF object the method returns, or replace both with a single shared survey IF helper so the per-cell SE and stored IF cannot diverge again. References: diff_diff/staggered.py:L709-L724, diff_diff/staggered.py:L1614-L1629, diff_diff/staggered.py:L1778-L1792, diff_diff/staggered.py:L1994-L2003, docs/methodology/REGISTRY.md:L419-L420.

Code Quality

  • No additional code-quality findings beyond the methodology wiring and variance issues above.

Performance

  • No new performance blockers found in the changed paths.

Maintainability

  • No new untracked maintainability blockers found beyond the issues above.

Tech Debt

  • Severity: P3. Impact: The PR correctly tracks one real deferred limitation in TODO.md: full design-based per-cell CallawaySantAnna survey SEs remain deferred. That tracking is appropriate, but it does not mitigate the separate SE bugs above. Concrete fix: none required for the tracking itself. Reference: TODO.md:L55-L56.

Security

  • No findings.

Documentation/Tests

  • Severity: P1. Impact: The Methodology Registry still says TripleDifference IPW/DR survey support is deferred, while the PR updates tests and the survey roadmap to treat it as implemented. For this project, REGISTRY.md is the methodology source of truth, so that is an unmitigated documentation/methodology mismatch. Concrete fix: after fixing the actual IPW/DR survey wiring, update the registry; if the feature remains incomplete, restore the guard and keep the registry as-is. References: docs/methodology/REGISTRY.md:L1248-L1249, tests/test_survey_phase3.py:L499-L529, tests/test_survey_phase4.py:L675-L716.

  • Severity: P2. Impact: The new DDD survey tests only assert that IPW/DR “work” in the sense of not raising and returning finite outputs. They never assert that survey weights actually affect IPW/DR estimates or that uniform weights collapse to the unweighted result, so they miss the core regression. Concrete fix: add one non-uniform-weight test per method that compares weighted vs unweighted ATT/SE, plus one uniform-weight invariance test for IPW and DR. References: tests/test_survey_phase3.py:L499-L529, tests/test_survey_phase4.py:L675-L716.

Path to Approval

  1. Fix or re-disable TripleDifference IPW/DR survey support. The minimal safe change is to restore the NotImplementedError; the preferred change is to thread survey_weights and resolved_survey through IPW/DR and verify weighted ATT/SE numerically.
  2. Rework ImputationDiD survey SEs so the Theorem 3 v_it construction uses weighted Step 1 algebra, not just weighted treated aggregation weights.
  3. Fix CallawaySantAnna survey no-covariate per-cell SEs to use the same IF algebra that the code stores and aggregates.
  4. Update docs/methodology/REGISTRY.md only after the estimator behavior is actually correct, and strengthen the DDD survey tests so they assert weighted behavior rather than mere finiteness.

…, CS SE formula

- P0: Thread survey_weights through TripleDifference IPW and DR call chains
  (_ipw_estimation, _doubly_robust, _compute_did_rc_ipw, _compute_did_rc_dr).
  Survey weights now enter Riesz representers for weighted Hajek averages.
- P1: Fix CallawaySantAnna no-covariate survey SE to derive from sum(IF^2)
  instead of sum(w_norm * (y-mean)^2). All 4 locations now consistent with
  stored influence functions.
- P1: Update REGISTRY.md TripleDifference entry to reflect full survey support
  (was still marked as "IPW/DR deferred").
- P2: Add behavioral tests for TripleDiff IPW/DR survey: non-uniform weights
  change ATT, uniform weights match unweighted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: ab65708f0116716c097d6fc16c57fd97a8a40850


Overall Assessment

⚠️ Needs changes

This re-review clears two of the prior blockers: TripleDifference no-covariate IPW/DR survey wiring is now present, and the CallawaySantAnna per-cell survey SE deviation is now implemented as IF-based per-cell variance and explicitly mitigated in both REGISTRY.md and TODO.md. The highest remaining unmitigated severity is still P1: ImputationDiD survey SEs do not implement the documented weighted conservative variance, and I found two additional survey-path P1s in covariate-adjusted TripleDifference IPW/DR and TwoStageDiD’s always-treated edge case.

Executive Summary

  • ImputationDiD now survey-weights ATT aggregation, but its Theorem 3 untreated-side v_it algebra is still unweighted, so survey SEs do not match the documented method.
  • The previous TripleDifference blocker where survey weights were dropped entirely on the no-covariate IPW/DR path appears fixed, but covariate-adjusted survey IPW/DR still computes nuisance IF corrections with unweighted Hessian/score/OLS terms, so those survey SEs remain wrong.
  • TwoStageDiD resolves survey arrays before excluding always-treated units; on that documented edge case, the filtered fit can fail with stale-length survey arrays.
  • The prior CallawaySantAnna per-cell survey SE concern is no longer blocking: the code now derives those SEs from the stored IF, and the remaining full design-based per-cell TSL gap is documented and tracked.
  • The new survey tests are materially better for no-covariate paths, but they still miss DDD IPW/DR with covariates and TwoStageDiD with always-treated units under survey.

Methodology

The touched paths are all source-sensitive: Borusyak-Jaravel-Spiess place ImputationDiD uncertainty on the Theorem 3 conservative variance, Gardner/did2s treats two-stage inference as a generated-regressor/GMM problem, Callaway-Sant’Anna support OR/IPW/DR staggered DiD with covariates, and the current DDD source paper explicitly emphasizes covariate-adjusted RA/IPW/DR inference. (economics-static.uchicago.edu)

Code Quality

  • No additional code-quality findings beyond the estimator/inference issues above.

Performance

  • No new performance blockers in the changed survey paths.

Maintainability

  • No additional maintainability findings beyond the stale in-code documentation noted below.

Tech Debt

  • Severity: P3-informational. Impact: The newly added TODO.md entry for CallawaySantAnna’s remaining full per-cell design-based TSL limitation is appropriate tracking and does mitigate that deferrable gap under the project rules. Concrete fix: None required unless you want to implement full unit-level per-cell survey TSL now. Evidence: TODO.md:L55, docs/methodology/REGISTRY.md:L419-L420.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: The new survey tests still miss the two concrete regressions above. TripleDifference survey IPW/DR is only tested on no-covariate data, while the only survey+covariate DDD test covers reg only; likewise the new TwoStageDiD survey tests never hit the documented always-treated exclusion path. That is why the remaining covariate-adjusted DDD SE issue and the filtered-survey length mismatch are not caught. Concrete fix: Add survey tests for DDD IPW+covariates and DR+covariates (both uniform-weight invariance and non-uniform-weight sensitivity), plus a TwoStageDiD survey test with at least one always-treated unit. Evidence: tests/test_survey_phase3.py:L499-L603, tests/test_survey_phase4.py:L341-L461, tests/test_survey_phase4.py:L672-L804.

  • Severity: P2. Impact: TripleDifference.fit()’s docstring still says survey support is only available for estimation_method="reg", while the methodology registry and the new tests now treat survey support as implemented for reg, ipw, and dr. That stale in-code reference will mislead users and future reviewers even after the code is fixed. Concrete fix: Update the docstring once the actual support boundary is correct, or restore the runtime guard if IPW/DR survey support is intentionally partial. Evidence: diff_diff/triple_diff.py:L480-L484, docs/methodology/REGISTRY.md:L1243-L1248.

Static-review note: I could not execute the test suite in this environment because pytest is unavailable and the local Python environment is missing numpy.

Path to Approval

  1. Rework ImputationDiD survey conservative variance so survey weights enter the untreated v_it construction itself in both the FE-only and covariate branches, then add a regression test that checks SE behavior under non-uniform vs uniform survey weights.
  2. Fix TripleDifference survey IPW/DR with covariates by weighting the PS Hessian/score and DR OLS linear-representation terms consistently with the survey fits, then add survey+covariate IPW/DR tests.
  3. Move TwoStageDiD survey resolution after the always-treated exclusion or subset/re-resolve all survey arrays immediately after filtering, then add a survey-weighted always-treated regression test.
  4. Update TripleDifference.fit()’s survey-support docstring once the actual support boundary is correct.

… always-treated survey

- ImputationDiD: use survey-weighted untreated sums (not raw counts) in Theorem 3
  v_it FE-only path, and weighted normal equations (A_0'W_0 A_0) in covariate path
- TripleDifference: weight PS Hessian, score correction, and DR OLS linear
  representations with survey weights for covariate-adjusted IPW/DR IF corrections;
  update fit() docstring to reflect full survey support
- TwoStageDiD: subset survey_weights and resolved_survey arrays after always-treated
  unit exclusion to prevent length mismatch
- Add tests: TwoStageDiD always-treated + survey, DDD covariate IPW/DR + survey

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 22, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: ae6b278a4512079d7564f2f73da320c9d91f19b2


Overall Assessment

⚠️ Needs changes

The previous ImputationDiD variance-weighting gap and TwoStageDiD always-treated survey-array bug look fixed in this diff, but there are still unmitigated P1 issues in the new survey inference math for CallawaySantAnna and covariate-adjusted TripleDifference.

Executive Summary

  • ImputationDiD now threads survey weights into untreated-side Theorem 3 denominators / projections, so the prior imputation variance blocker appears resolved.
  • TwoStageDiD now subsets survey arrays after excluding always-treated units, so the prior always-treated edge-case blocker appears resolved.
  • CallawaySantAnna survey aggregation SEs are still wrong: the new survey WIF path mis-scales the weighted group-share influence function, so overall / event-study / group SEs do not implement the documented survey-weighted aggregation.
  • TripleDifference survey IPW/DR no longer hard-fails, but the covariate-adjusted survey IF corrections still apply survey weights twice in several moment maps, so those SEs remain incorrect.
  • CallawaySantAnna survey covariate paths (reg, ipw, dr) do not carry the weighted nuisance-estimation IF terms implied by the registry claim that survey SEs follow the analytical IF path, so per-cell and aggregated SEs are wrong when covariates are used.
  • The new tests improve plumbing coverage, but they still do not pin down the new SE algebra: there are no deterministic WIF checks for CS, no CS survey+covariate tests, and no SE-focused covariate survey tests for TripleDifference.

Methodology

  • Severity: P1 [Newly identified]. Impact: CallawaySantAnna’s new survey-weighted WIF is algebraically mis-scaled, so survey overall / dynamic / group aggregation SEs are wrong for all methods. The code replaces raw cohort indicators with sw_i / sum(sw) before subtracting pg, then divides by n_units again; that drops the unit weight from the negative pg term and applies the normalization on the wrong scale. This is not the documented per-cell TSL deviation in TODO.md; it is a separate aggregation-level SE error. Concrete fix: derive the WIF from the weighted share estimator p_g = Σ s_i 1{G_i=g} / Σ s_i, keep the s_i factor on both terms of the cohort-share IF, and apply the final normalization once. Add a small deterministic regression test for simple, event-study, and group aggregation SEs. Evidence: diff_diff/staggered_aggregation.py:L265-L390, diff_diff/staggered_aggregation.py:L397-L451, docs/methodology/REGISTRY.md:L419-L420, TODO.md:L55-L56

  • Severity: P1 (unresolved prior finding). Impact: TripleDifference survey IPW/DR with covariates still has incorrect SEs. The PR fixed the obvious missing-weight omissions, but several correction moments now weight the same terms twice: the survey branch first multiplies riesz_* by weights, then computes M2_*, M3_*, and mom_* with np.average(..., weights=weights). That distorts the combined IF that is later passed into TSL. Concrete fix: in the covariate IPW/DR survey branches, make survey weights enter each estimating equation exactly once; either keep weights inside riesz_* and use plain means afterward, or keep riesz_* unweighted and use a single weighted empirical average consistently. Add SE-focused covariate survey tests, including weight-rescaling invariance. Evidence: diff_diff/triple_diff.py:L1288-L1353, diff_diff/triple_diff.py:L1481-L1717, diff_diff/triple_diff.py:L1078-L1085, docs/methodology/REGISTRY.md:L1195-L1248

  • Severity: P1 [Newly identified]. Impact: CallawaySantAnna’s survey covariate branches do not implement the weighted nuisance-estimation influence terms promised by the new registry note. In _outcome_regression, _ipw_estimation, and _doubly_robust, the survey inf_func collapses to simple weighted residual pieces and omits the regression / propensity-score estimation corrections that drive the analytical IF path. Because the same inf_func is reused for aggregation, both per-cell and aggregated SEs are wrong whenever covariates are supplied. Concrete fix: port the full weighted analytical IF into the survey reg / ipw / dr branches, or reject covariates+survey at runtime and narrow the docs until that IF is implemented. Evidence: diff_diff/staggered.py:L1572-L1601, diff_diff/staggered.py:L1682-L1754, diff_diff/staggered.py:L1859-L1976, docs/methodology/REGISTRY.md:L419-L420

Code Quality

  • No additional code-quality findings beyond the inference-math issues above.

Performance

  • No new performance blockers in the changed survey paths.

Maintainability

  • No additional maintainability findings beyond the support-boundary ambiguity noted in Documentation/Tests.

Tech Debt

  • Severity: P3. Impact: the new TODO.md entries properly track the documented per-cell CS survey-TSL limitation and the pre-existing EfficientDiD stale-n_cl issue, so those tracked items are not blockers under this review rubric. Concrete fix: none required for this PR unless you choose to implement those deferred items now. Evidence: TODO.md:L55-L56

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests mainly check “works / finite / ATT changes” behavior, but they do not verify the newly added SE algebra that is currently broken. There are no deterministic CS WIF tests, no CallawaySantAnna survey+covariate tests, and no TripleDifference covariate survey SE regression tests. The roadmap also now advertises TripleDifference and CallawaySantAnna survey support as “Full,” which is overstated until the SE issues above are fixed or those combinations are gated/documented as partial. Concrete fix: add targeted SE regression tests for the flagged cases; if the math is deferred, revert the new “Full” support claims and restore runtime guards for the unsupported combinations. Evidence: tests/test_survey_phase4.py:L495-L980, docs/survey-roadmap.md:L17-L25, docs/survey-roadmap.md:L43-L53, diff_diff/triple_diff.py:L480-L496

Path to Approval

  1. Fix CallawaySantAnna’s survey WIF so the cohort-share influence function is derived from the weighted share estimator and normalized exactly once, then add deterministic SE regression tests for simple, event-study, and group aggregation.
  2. Fix TripleDifference covariate survey IPW/DR moment maps so survey weights enter each IF correction term exactly once, then add covariate survey SE tests that check weight-rescaling invariance.
  3. Implement the full weighted nuisance-estimation IF in CallawaySantAnna survey reg / ipw / dr covariate branches, or explicitly reject covariates+survey for those methods until that is done.
  4. If any unsupported survey combinations remain after the code fixes, narrow the new “Full” support claims in the registry / roadmap and gate those combinations at runtime instead of documenting them as complete.

Static-review note: I could not run the targeted tests in this environment because pytest is not installed.

…riate nuisance IF

- TripleDifference: remove double-weighting in IPW/DR moment corrections —
  since Riesz representers already incorporate survey weights, moment means
  use np.mean() not np.average(weights=). Removed _wmean_ax0 helper.
- CallawaySantAnna WIF: apply s_i symmetrically to both indicator and pg terms
  in the weighted share estimator IF. Normalize by total_weight (sum of survey
  weights) instead of n_units.
- CallawaySantAnna outcome regression covariate IF: add weighted regression
  nuisance IF correction (asymptotic linear representation of beta from WLS,
  projected onto weighted treated covariate mean). IPW and DR IFs unchanged
  (IPW matches unweighted structure; DR is self-correcting per Theorem 3.1).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 23, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: e754b04cc72778b32a737fd6dea0c07bc67fa3c4


Overall Assessment

⚠️ Needs changes

Unmitigated P1s remain in the new survey inference paths. The earlier ImputationDiD variance-weighting gap and the obvious TripleDifference survey double-weighting problem look resolved in this diff, but CallawaySantAnna still has wrong survey SE algebra, and TwoStageDiD adds a new always-treated survey d.f. edge-case bug.

Executive Summary

  • The previous ImputationDiD blocker looks fixed; I did not find a new P1 in the changed imputation variance path.
  • The previous TripleDifference survey IPW/DR blocker also looks fixed; I did not find an unmitigated P1 in the changed weighted IF code there.
  • CallawaySantAnna survey aggregation SEs are still wrong: the new survey WIF branch does not implement the official did weight-IF formula and rescales the term incorrectly.
  • CallawaySantAnna survey + covariates is still incomplete for ipw/dr: the new survey branches omit the nuisance-estimation IF corrections needed for analytical SEs.
  • TwoStageDiD now subsets survey arrays after dropping always-treated units, but it leaves n_psu / n_strata stale, so survey d.f. and reported metadata can be wrong on that edge case.
  • The new tests are mostly smoke / plumbing checks and do not pin down the broken Callaway survey SE algebra; the roadmap’s Full (analytical) claim is therefore overstated.

Methodology

  • Severity: P1 (unresolved prior finding). Impact: CallawaySantAnna survey aggregation SEs are still incorrect. In diff_diff/staggered_aggregation.py:L343-L396, the survey WIF branch uses unit_sw * pg in the negative term and divides by total_weight inside indicator_diff, then divides by total_weight again at psi_wif = wif_contrib / total_weight. The official did::wif() uses weights.ind * I(G=g) - pg and adds that WIF directly to the aggregate IF before a single getSE() scaling step, so this branch is both algebraically different and over-normalized. Concrete fix: derive the survey WIF from unit_sw * I(G=g) - pg, remove the inner /total_weight, keep only the final normalization, and add deterministic SE regression tests for simple, event-study, and group aggregation. Evidence: diff_diff/staggered_aggregation.py:L343-L396, diff_diff/staggered_aggregation.py:L403-L455, docs/methodology/REGISTRY.md:L419-L420. (rdrr.io)

  • Severity: P1 (unresolved prior finding, narrowed). Impact: CallawaySantAnna survey + covariates is still missing required analytical IF terms for estimation_method="ipw" and "dr". The new survey IPW branch in diff_diff/staggered.py:L1707-L1779 stops at the leading treated/control terms, but the official panel IPW IF includes an additional propensity-score estimation correction (inf.cont.2). The new survey DR branch in diff_diff/staggered.py:L1884-L2001 similarly reduces to psi_treated/psi_control and omits the treated OR correction plus the control PS/OR corrections. That makes per-cell and aggregated survey SEs wrong whenever covariates are supplied. Concrete fix: port the DRDID panel nuisance-estimation IF pieces into these survey branches, or reject survey_design with covariates for ipw/dr until that is implemented; then add reference-based SE tests on a fixed dataset. Evidence: diff_diff/staggered.py:L1707-L1779, diff_diff/staggered.py:L1884-L2001, docs/methodology/REGISTRY.md:L419-L420. (rdrr.io)

  • Severity: P1 [Newly identified]. Impact: TwoStageDiD can emit wrong survey d.f., p-values, confidence intervals, and survey metadata after excluding always-treated units. The new filtering path slices the resolved survey arrays with replace(...) in diff_diff/two_stage.py:L290-L314, but ResolvedSurveyDesign.df_survey is computed from cached n_psu / n_strata in diff_diff/survey.py:L295-L304, and those counts are then consumed at diff_diff/two_stage.py:L434-L435. If dropping always-treated units removes a PSU or an entire stratum, inference is silently based on stale counts. Concrete fix: after filtering, re-resolve the survey design on the filtered df or recompute/re-factorize PSU and stratum counts before using df_survey or compute_survey_metadata(), and add a regression test where always-treated units occupy their own PSU/stratum.

Code Quality

  • No additional code-quality findings beyond the inference bugs above.

Performance

  • No new performance blockers identified in the changed survey paths.

Maintainability

  • Severity: P2. Impact: the new survey feature is not exposed consistently through the public convenience APIs. diff_diff/imputation.py:L1888-L1945 and diff_diff/two_stage.py:L1589-L1645 still construct the estimator from **kwargs and never accept/pass survey_design into fit(), so users of imputation_did() / two_stage_did() cannot reach the new capability. Concrete fix: add survey_design=None to both wrappers, thread it through, and add wrapper-parity tests.

Tech Debt

  • Severity: P3. Impact: the new TODO entry for Callaway per-cell survey TSL is properly tracked and is not a blocker under this rubric; same for the unrelated EfficientDiD stale-n_cl item. Concrete fix: none required for this PR. Note that TODO.md:L55 does not cover the unmitigated aggregation WIF bug or the covariate-survey IF omissions above.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests do not pin down the broken CallawaySantAnna survey IF math. The full-design tests only assert metadata and df-driven p-value differences in tests/test_survey_phase4.py:L573-L602 and tests/test_survey_phase4.py:L955-L980; they do not include a deterministic/reference survey aggregation SE check, and they do not cover Callaway survey + covariate ipw/dr SEs. At the same time, the roadmap labels CallawaySantAnna as Full (analytical) in docs/survey-roadmap.md:L47-L49. Concrete fix: add reference-based Callaway survey SE tests for simple/event-study/group aggregation plus covariate-adjusted ipw/dr, or narrow/gate the documented support surface until the P1 math issues are fixed.

Review note: pytest is not installed in this environment, so I could not execute the added tests; this is a static review only.

Path to Approval

  1. Fix the CallawaySantAnna survey WIF algebra in diff_diff/staggered_aggregation.py to match did::wif() and add a deterministic weighted-SE regression test for simple, event-study, and group aggregation.
  2. Implement the missing DRDID-style nuisance-estimation IF corrections in the Callaway survey ipw and dr covariate branches, or explicitly reject survey_design + covariates + {ipw, dr} at runtime until that is done.
  3. Rebuild/re-resolve the survey design after always-treated filtering in TwoStageDiD, then add a test where filtered units remove an entire PSU or stratum and assert the updated df_survey/p-values.

…ge n_psu

- CallawaySantAnna WIF: remove inner /total_weight from indicator_diff — the
  final psi_wif/total_weight handles normalization once, matching R's did::wif()
- CallawaySantAnna IPW covariate: add propensity score nuisance IF correction
  (survey-weighted Hessian, score, M2 gradient) so per-cell and aggregated SEs
  account for PS estimation uncertainty
- TwoStageDiD: recompute n_psu/n_strata after always-treated filtering via
  np.unique() on subsetted arrays, then recompute survey_metadata

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Mar 23, 2026

/ai-review

@github-actions
Copy link

🔁 AI review rerun (requested by @igerber)

Head SHA: 98d80c60f2fc5c4d9cb6220e28c092f03dcdc646


Overall Assessment

⚠️ Needs changes

This re-review leaves two unmitigated P1s in the new survey CallawaySantAnna analytical inference paths. The prior TwoStageDiD always-treated survey d.f. issue appears fixed in diff_diff/two_stage.py:290, and I did not find a new P1+ elsewhere in the changed survey code.

Executive Summary

Methodology

  • Severity: P1 (unresolved prior finding). Impact: CallawaySantAnna survey aggregation SEs remain methodologically incorrect for aggregate="simple" and aggregate="event_study". The survey WIF branch in diff_diff/staggered_aggregation.py:343 builds indicator_diff as s_i * 1{G_i=g_k} - s_i * pg_k via weighted_pg_term, but the official did WIF uses weights.ind * 1{G_i=g_k} - pg_k. The registry note at docs/methodology/REGISTRY.md:419 documents survey-weighted group sizes, not this algebraic deviation. The affected aggregation call sites are diff_diff/staggered_aggregation.py:130 and diff_diff/staggered_aggregation.py:548. Concrete fix: reimplement the survey WIF exactly as in did::wif()/aggte and add deterministic regression tests for survey simple and event_study SEs.
  • Severity: P1 (unresolved prior finding). Impact: CallawaySantAnna survey + covariates analytical SEs remain incorrect for estimation_method="ipw" and "dr". In the survey IPW branch diff_diff/staggered.py:1759, the new PS correction uses a normalized control-only M2 term that does not match the DRDID panel IPW influence function. In the survey DR branch diff_diff/staggered.py:2014, the influence function stops at psi_treated/psi_control and omits the nuisance-estimation corrections from the DRDID panel formula. The note at docs/methodology/REGISTRY.md:420 only documents the per-cell TSL-vs-IF choice, so it does not mitigate this mismatch. Concrete fix: port the exact DRDID panel nuisance-estimation IF pieces into the survey ipw and dr covariate paths, or raise NotImplementedError for survey_design + covariates + {"ipw","dr"} until that implementation is complete; then add reference-based per-cell and aggregated SE tests.

Code Quality

  • No additional code-quality findings beyond the inference bugs above.

Performance

  • No new performance blockers identified in the changed survey paths.

Maintainability

  • Severity: P2 (unresolved prior finding). Impact: the new survey capability is still unavailable through the imputation_did() and two_stage_did() convenience APIs, because both wrappers only forward constructor kwargs and never pass survey_design into fit() at diff_diff/imputation.py:1888 and diff_diff/two_stage.py:1608. Concrete fix: add explicit survey_design=None parameters to both wrappers, thread them through to fit(), and add wrapper-parity tests.

Tech Debt

  • Severity: P3. Impact: the per-cell CallawaySantAnna survey TSL limitation is properly documented in docs/methodology/REGISTRY.md:420 and tracked in TODO.md:55, so it is not a blocker under this rubric. This tracking does not mitigate the P1 aggregation WIF or survey+covariates IF defects above. Concrete fix: none required for approval; keep the registry/TODO note synchronized if the implementation changes.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Fix the CallawaySantAnna survey WIF algebra in diff_diff/staggered_aggregation.py to match did::wif() exactly, then add deterministic regression tests for survey aggregate="simple" and aggregate="event_study" SEs.
  2. Implement the full DRDID panel nuisance-estimation influence-function corrections for survey CallawaySantAnna with covariates in the ipw and dr paths, or explicitly block survey_design + covariates + {"ipw","dr"} with NotImplementedError until that work is done.
  3. After those fixes, add reference-based tests and/or narrow docs/survey-roadmap.md:49 so the documented support surface matches the validated methodology.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant