feat: survey-aware power analysis (SurveyPowerConfig + deff) by igerber · Pull Request #292 · igerber/diff-diff

igerber · 2026-04-12T16:38:47Z

Summary

Add SurveyPowerConfig dataclass for simulation-based survey power analysis
Add survey_config parameter to simulate_power, simulate_mde, simulate_sample_size - swaps DGP to generate_survey_did_data and injects SurveyDesign into estimator fit()
Add deff parameter to closed-form PowerAnalysis methods and convenience functions (compute_power, compute_mde, compute_sample_size)
9 estimators supported (DiD, TWFE, MultiPeriod, CS, SA, Imputation, TwoStage, Stacked, Efficient), 3 blocked with clear error (TROP, SyntheticDiD, TripleDifference)
DGP truth (Kish DEFF, realized ICC) accumulated across simulations and reported in results
Document deff and survey_config methodology in REGISTRY.md PowerAnalysis section

Methodology references (required if estimator / math changes)

Method name(s): Survey-aware power analysis, design effect (DEFF) variance inflation
Paper / source link(s): Kish, L. (1965). Survey Sampling. Wiley. (DEFF = 1 + CV(w)^2); Bloom (1995) MDE formula extended with multiplicative deff
Any intentional deviations from the source (and why): deff is a simple multiplicative inflation (not decomposed into clustering/weighting/stratification components) - the simulation path handles the full complexity, closed-form is deliberately approximate

Validation

Tests added/updated: tests/test_power.py - 32 new tests in TestSurveyPower class (simulation smoke tests for all 9 estimators, rejection tests for 3 unsupported, metadata/DEFF verification, closed-form deff correctness, validation edge cases)
All 153 power tests pass with 0 regressions
Backtest / simulation / notebook evidence (if applicable): N/A

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Connect survey infrastructure to power module: - SurveyPowerConfig dataclass for simulation-based survey power - survey_config param on simulate_power/mde/sample_size swaps DGP to generate_survey_did_data and injects SurveyDesign into fit() - deff param on closed-form PowerAnalysis methods and convenience functions - 9 estimators supported, 3 blocked with clear error (factor model DGPs) - DGP truth (Kish DEFF, realized ICC) reported in simulation results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…sign - Add REGISTRY.md notes for analytical deff parameter (variance/sample-size inflation formulas, deff vs rho distinction) and survey_config simulation path (supported estimators, mutual exclusivity, protected keys) - Cache SurveyDesign in SurveyPowerConfig._build_survey_design() to avoid rebuilding per simulation iteration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-12T16:47:55Z

Overall Assessment

⚠️ Needs changes

I found three unmitigated P1 issues in the new survey-aware power path: two methodology/adapter mismatches (DifferenceInDifferences, MultiPeriodDiD) and one edge-case correctness bug in simulate_sample_size()’s new survey floor logic.

Executive Summary

DifferenceInDifferences survey power is wired to the wrong treatment variable: the survey DGP’s post-only treated flag is being passed where the estimator expects a treated-group indicator, which makes the DiD design matrix rank-deficient.
MultiPeriodDiD has the same core wiring problem, plus the adapter omits unit="unit", so the estimator’s own time-varying-treatment warning path is skipped.
simulate_sample_size() does not fully enforce the new survey_config.min_viable_n floor in the auto-bracketing branch, so it can return a sample size below the documented minimum.
The closed-form deff change itself looks acceptable: the approximation is explicitly documented in the registry, so I do not count that as a defect.
I could not execute the new tests in this environment because the local runner is missing numpy; the review below is from static inspection.

Methodology

P1 Impact: Survey power for DifferenceInDifferences is not estimating the intended 2x2 DiD parameter. The survey DGP only emits a post-treatment indicator treated = 1{g_i>0, t>=g_i}, but _survey_basic_fit_kwargs() passes that as the treated-group indicator, and DifferenceInDifferences.fit() then constructs treated:post again. Under the single-cohort survey DGP, treated and treated:post are identical, so the ATT column is rank-deficient and the fixed att_idx=3 no longer reliably corresponds to the DiD estimand. References: diff_diff/power.py:L363-L376, diff_diff/prep_dgp.py:L1568-L1593, diff_diff/estimators.py:L168-L171, diff_diff/estimators.py:L305-L333. Concrete fix: derive an ever_treated = (first_treat > 0).astype(int) column in the survey branch and pass that to DifferenceInDifferences instead of the post-only treated flag.
P1 Impact: Survey power for MultiPeriodDiD is also wired to the wrong treatment semantics. _survey_multiperiod_fit_kwargs() passes the time-varying treated column even though MultiPeriodDiD.fit() explicitly requires a time-invariant ever-treated indicator; it also omits unit="unit", so the estimator’s own time-varying-treatment warning path never runs. That leaves pre-period interactions unidentified and post-period terms collinear with the treatment main effect, so the resulting power/MDE/sample-size outputs are not methodologically valid. References: diff_diff/power.py:L396-L410, diff_diff/prep_dgp.py:L1568-L1593, diff_diff/estimators.py:L883-L887, diff_diff/estimators.py:L944-L998. Concrete fix: derive and pass ever_treated, and also pass unit="unit" so invalid treatment timing is validated in the adapter path.
P3 Impact: No action required on the closed-form deff approximation itself. The PR documents it in the methodology registry as an intentional approximation/deviation, and Bloom’s original MDE discussion explicitly notes that the base equations do not account for multisite design effects. References: docs/methodology/REGISTRY.md:L2257-L2258. (journals.sagepub.com)

Code Quality

P1 Impact: simulate_sample_size() only partially applies the new survey minimum-N floor. Although it computes abs_min = max(abs_min, survey_config.min_viable_n), the no-n_range branch still starts from lo = min_n; if that first probe already achieves target power, the function can early-return required_n < survey_config.min_viable_n. That contradicts both the new min_viable_n contract and the new registry note. References: diff_diff/power.py:L163-L166, diff_diff/power.py:L2756-L2815, docs/methodology/REGISTRY.md:L2258-L2258. Concrete fix: initialize the auto-bracketing lower bound from abs_min when survey_config is present, and clamp all early-return paths to abs_min.

Performance

No findings.

Maintainability

No findings requiring action.

Tech Debt

No existing TODO.md entry appears to track the three P1 issues above, so they are not mitigated by deferred-work tracking.

Security

No findings.

Documentation/Tests

P2 Impact: The new tests are mostly smoke tests and do not protect the broken adapter logic. test_survey_simulate_power_basic_did and test_survey_simulate_power_multiperiod only assert 0 <= power <= 1, which still passes for rank-deficient or NaN-driven outputs, and the new sample-size floor test only exercises a low-effect branch, not the large-effect early-return branch. References: tests/test_power.py:L2155-L2189, tests/test_power.py:L2332-L2349. Concrete fix: add regression tests that validate finite/non-rank-deficient survey outputs for DifferenceInDifferences and MultiPeriodDiD, and add a large-effect simulate_sample_size() test that would have returned below min_viable_n before the fix.

Path to Approval

Fix the survey DifferenceInDifferences adapter to use an ever_treated indicator instead of the post-only treated flag, and add a regression test that asserts finite ATT/SE/p-value without rank-deficiency warnings.
Fix or temporarily remove MultiPeriodDiD from the supported survey-power list. If kept, pass ever_treated plus unit="unit" and add a test that validates non-NaN survey power on that path.
Apply the survey min_viable_n floor to all simulate_sample_size() branches, especially the no-n_range early-return path, and add a large-effect regression test that proves required_n >= survey_config.min_viable_n even when the first lower-bound probe already achieves target power.

…tronger tests - Fix P1: basic/TWFE/MultiPeriod survey adapters now derive `ever_treated` (time-invariant group indicator) instead of using the DGP's post-only `treated` column, which caused rank-deficient design matrices - Fix P1: MultiPeriodDiD adapter now passes `unit="unit"` for proper time-varying-treatment validation - Fix P1: simulate_sample_size auto-bracketing starts from max(min_n, abs_min) and clamps early-return to abs_min, enforcing min_viable_n contract - Fix P2: strengthen tests to assert finite mean_estimate/mean_se (not just 0 <= power <= 1), add large-effect floor regression test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T16:56:16Z

/ai-review

github-actions · 2026-04-12T17:06:13Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 40acf5b13b9a28080981e6e7cda16c458b6e7ce3

Overall Assessment

⛔ Blocker

Executive Summary

Re-review: the prior survey-adapter issues for DifferenceInDifferences and MultiPeriodDiD appear fixed; the new path now derives ever_treated and passes unit where needed.
The closed-form deff change itself looks methodologically acceptable because the multiplicative approximation is explicitly documented in the Methodology Registry.
simulate_power() still treats the scalar input treatment_effect as the simulation truth even when the survey DGP explicitly produces a different population ATT under heterogeneous treatment effects. That makes the reported effect scale, bias/RMSE/coverage, and the interpretation of power/MDE/sample-size wrong for supported survey configs.
The earlier min_viable_n floor issue is only partially fixed: one simulate_sample_size() auto-bracketing branch still seeds hi below the survey floor for large survey designs.
The new tests cover the earlier adapter fixes and an 80-unit survey floor, but they do not cover heterogeneous survey truth handling or the min_viable_n > 100 branch that still violates the documented floor.
I could not execute the tests locally because this environment does not have pytest or numpy.

Methodology

Severity P0. Impact: the new survey-aware simulation methods are not aligned with the DGP’s own truth under heterogeneous effects. generate_survey_did_data() documents that heterogeneous_te_by_strata=True creates a gap between the base treatment_effect and the population ATT, and it emits that truth as df.attrs["dgp_truth"]["population_att"]; the wrapper only records DEFF/ICC, still checks CI coverage against the scalar input effect, and stores true_effect=primary_effect, from which bias and rmse are derived. Because simulate_mde() and simulate_sample_size() delegate to simulate_power(), they also end up targeting the wrong effect scale under supported heterogeneous survey configs. References: diff_diff/prep_dgp.py:L1218-L1219, diff_diff/prep_dgp.py:L1279-L1283, diff_diff/prep_dgp.py:L1623-L1630, diff_diff/power.py:L2033-L2067, diff_diff/power.py:L2140-L2202, diff_diff/power.py:L1017-L1020, diff_diff/power.py:L2507-L2521, diff_diff/power.py:L2743-L2755. Concrete fix: either calibrate the heterogeneous survey DGP so the realized population ATT matches the user-facing treatment_effect, or reject heterogeneous-effect survey configs in these wrappers until that calibration exists; in either case, propagate per-simulation population_att into truth/coverage/bias calculations instead of using the scalar input effect.
Severity P3. Impact: no action required on the closed-form deff approximation itself. The Registry explicitly documents the multiplicative variance/sample-size inflation and its distinction from rho, so I do not count that as a methodology defect. Reference: docs/methodology/REGISTRY.md:L2257-L2258. Concrete fix: none.

Code Quality

Severity P1. Impact: the new survey min_viable_n floor is still incomplete in simulate_sample_size(). lo is raised to abs_min, but when power_lo < power the code still seeds hi = max(100, 2 * min_n) instead of respecting abs_min/lo. For survey designs with survey_config.min_viable_n > 100, that branch probes below the documented survey floor and can still return required_n < min_viable_n if a noisy power estimate at the smaller n crosses the target. That directly contradicts the new Registry note that the survey search raises the bisection floor. References: diff_diff/power.py:L2767-L2855, docs/methodology/REGISTRY.md:L2258-L2258. Concrete fix: seed hi from at least abs_min/lo (for example max(2 * lo, abs_min)), never call _power_at_n() with n < abs_min, and clamp the final best_n before returning.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3. Impact: I found no matching entry in the repo’s deferred-review table for the two issues above, so they are unmitigated under the project’s review policy. Reference: TODO.md:L51-L94. Concrete fix: resolve them in this PR; tracking them in TODO.md would not mitigate a P0 or P1.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new tests only cover default survey metadata and an 80-unit floor case, so they would not catch either the heterogeneous-truth bug or the remaining high-floor auto-bracketing bug. The public docstrings for simulate_power(), simulate_mde(), and simulate_sample_size() also still omit the new survey_config parameter, so API docs/help will not surface the new constraints. References: tests/test_power.py:L2256-L2269, tests/test_power.py:L2342-L2378, diff_diff/power.py:L1785-L1814, diff_diff/power.py:L2450-L2484, diff_diff/power.py:L2675-L2695. Concrete fix: add regression tests for heterogeneous survey truth propagation and for a min_viable_n > 100 sample-size search branch; update the three wrapper docstrings to document survey_config, its incompatibilities, and the supported estimator set.

Path to Approval

In simulate_power(), collect per-simulation dgp_truth["population_att"] and use it for truth/coverage/bias computations instead of the scalar input effect.
In the survey search helpers, make the target effect correspond to actual population ATT: either calibrate the heterogeneous survey DGP so population_att matches treatment_effect, or reject heterogeneous-effect survey configs until that calibration exists.
In simulate_sample_size(), change the no-n_range auto-bracketing branch so it never evaluates n < survey_config.min_viable_n and cannot return below that floor.
Add regression tests for SurveyPowerConfig(heterogeneous_te_by_strata=True) and for a large-floor case such as SurveyPowerConfig(n_strata=10, psu_per_stratum=10) that exercises the power_lo < target branch.

…et, docstrings - P0: reject heterogeneous_te_by_strata=True in survey power validation (DGP population_att diverges from input treatment_effect, making bias/coverage/RMSE metrics misleading) - P1: fix simulate_sample_size auto-bracketing hi to respect abs_min (hi = max(2*lo, abs_min, 100) prevents probing below survey floor) - P2: add survey_config parameter to simulate_power/mde/sample_size docstrings with constraints and supported estimators - P2: add regression tests for heterogeneous_te rejection and large-floor (n_strata=10, psu_per_stratum=10, min_viable_n=200) auto-bracketing branch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T17:20:08Z

/ai-review

github-actions · 2026-04-12T17:26:57Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 685c92609d243503b8b267d56bd950412bfc3746

Overall Assessment

⛔ Blocker

Executive Summary

Re-review: the prior survey-power blocker around heterogeneous_te_by_strata is now mitigated by an explicit rejection, and the min_viable_n > 100 sample-size auto-bracketing issue appears fixed and regression-tested in the changed code. diff_diff/power.py:L1916-L1922 diff_diff/power.py:L2871-L2876 tests/test_power.py:L2380-L2408
The analytical deff change is methodologically acceptable as implemented because the multiplicative variance/sample-size inflation is explicitly documented in the Methodology Registry; that is a documented approximation, not a defect. docs/methodology/REGISTRY.md:L2257-L2258
Blocker: the new survey-power path still silently uses the scalar input treatment_effect as the truth even when the survey DGP can generate a different realized weighted ATT through te_covariate_interaction. diff_diff/prep_dgp.py:L1299-L1303 diff_diff/prep_dgp.py:L1570-L1579 diff_diff/prep_dgp.py:L1624-L1629 diff_diff/power.py:L1916-L1933 diff_diff/power.py:L2051-L2081 diff_diff/power.py:L2151-L2158 diff_diff/power.py:L2199-L2231
simulate_mde() and simulate_sample_size() inherit that mismatch because both search by repeatedly calling simulate_power(). diff_diff/power.py:L2524-L2543 diff_diff/power.py:L2763-L2782
The new tests cover the explicit heterogeneous_te_by_strata rejection and a benign passthrough (unit_fe_sd), but they do not cover the remaining covariate-interaction heterogeneity path. tests/test_power.py:L2298-L2309 tests/test_power.py:L2398-L2408

Methodology

Severity P0. Impact: the survey-power wrappers still produce wrong truth-based diagnostics on a reachable survey DGP path. generate_survey_did_data() exposes te_covariate_interaction as unit-level treatment-effect heterogeneity and computes the realized weighted population_att in df.attrs["dgp_truth"], but the new survey-power path only rejects heterogeneous_te_by_strata, merges other passthrough DGP kwargs, and then continues to score coverage against the scalar effect, store true_effect=primary_effect, and derive bias/RMSE from that scalar. This silently misstates coverage/bias/RMSE whenever realized population_att != treatment_effect, and the same mismatch can distort the target effect scale in simulate_mde() / simulate_sample_size() because both delegate to simulate_power(). diff_diff/prep_dgp.py:L1299-L1303 diff_diff/prep_dgp.py:L1412-L1415 diff_diff/prep_dgp.py:L1570-L1579 diff_diff/prep_dgp.py:L1624-L1629 diff_diff/power.py:L1916-L1933 diff_diff/power.py:L2051-L2081 diff_diff/power.py:L2151-L2158 diff_diff/power.py:L2216-L2231 diff_diff/power.py:L1017-L1020 Concrete fix: either reject te_covariate_interaction != 0.0 (and any other survey DGP option that makes realized ATT non-scalar) when survey_config is set, or plumb per-simulation dgp_truth["population_att"] through coverage, true_effect, bias, RMSE, and the search wrappers.
Severity P3. Impact: no action required on the analytical deff approximation itself. The Registry explicitly documents Var(ATT) *= deff and n_total *= deff as the intended closed-form approximation, so this is a documented deviation rather than a methodology defect. docs/methodology/REGISTRY.md:L2257-L2258 Concrete fix: none.

Code Quality

No new findings in the changed code.

Performance

No new findings in the changed code.

Maintainability

No new findings in the changed code.

Tech Debt

No mitigating TODO.md entry covers the blocker above, so it remains unmitigated. TODO.md:L51-L97

Security

No new findings in the changed code.

Documentation/Tests

Severity P2. Impact: the new survey test block does not exercise the remaining non-scalar ATT path. It tests one safe passthrough (unit_fe_sd) and the explicit heterogeneous_te_by_strata rejection, so the blocker above would still pass this PR’s added tests unchanged. tests/test_power.py:L2298-L2309 tests/test_power.py:L2342-L2408 Concrete fix: add a regression using survey_config=SurveyPowerConfig(informative_sampling=True) plus data_generator_kwargs={"add_covariates": True, "te_covariate_interaction": 1.0} and assert either a clear ValueError or correct truth handling via dgp_truth["population_att"].
Local execution note: I could not run the tests in this environment because pytest and numpy are not installed.

Path to Approval

Reject survey-power configurations that can make realized ATT differ from the scalar input unless truth propagation is implemented; at minimum, block data_generator_kwargs["te_covariate_interaction"] != 0.0 when survey_config is set.
Or, if that path should remain supported, collect per-simulation dgp_truth["population_att"] in simulate_power() and compute coverage, true_effect, bias, and RMSE against that value instead of the scalar input effect.
Apply the same truth semantics to simulate_mde() and simulate_sample_size() since both search by repeatedly calling simulate_power().
Add a regression test for the reachable heterogeneous path above so the remaining blocker cannot recur silently.

…_config P0: reject data_generator_kwargs with te_covariate_interaction != 0.0 or covariate_effects when survey_config is active. These DGP params make the realized population ATT diverge from the scalar treatment_effect input, which would misstate bias/coverage/RMSE metrics in simulation power. Add regression tests for both rejection paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T17:37:39Z

/ai-review

github-actions · 2026-04-12T17:54:06Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a1d6199933eebde01f4362bea2754bf761604eb7

Overall Assessment

⚠️ Needs changes

Highest unmitigated finding: survey_config now exposes panel=False as a generic survey-DGP passthrough, but the wrapper does not reject that repeated-cross-section DGP for panel-only estimators.

Executive Summary

Re-review: the prior survey-power blocker around heterogeneous truth handling is addressed. The new survey path explicitly rejects heterogeneous_te_by_strata=True and nonzero te_covariate_interaction, and the added tests cover those guards. diff_diff/power.py:L1916-L1945 tests/test_power.py:L2398-L2423
The analytical deff change is methodologically acceptable as implemented because the multiplicative variance/sample-size inflation is explicitly documented in the Methodology Registry. docs/methodology/REGISTRY.md:L2257-L2258 diff_diff/power.py:L1240-L1537
Severity P1: survey_config now advertises panel as an allowed survey-DGP passthrough, but simulate_power() does not reject panel=False for panel-only estimators. That leaves reachable repeated-cross-section simulations for estimators whose own fit contracts require panel or balanced-panel input. docs/methodology/REGISTRY.md:L2258-L2258 diff_diff/power.py:L1927-L1951 diff_diff/power.py:L2065-L2138 diff_diff/prep_dgp.py:L1251-L1255 diff_diff/twfe.py:L70-L85 diff_diff/sun_abraham.py:L465-L481 diff_diff/efficient_did.py:L311-L335 diff_diff/staggered.py:L1389-L1420
Severity P2: the new covariate_effects rejection is based on incorrect DGP semantics; covariate_effects alters baseline outcomes and ICC calibration, not the realized population_att, so the guard is over-broad. diff_diff/power.py:L1946-L1950 diff_diff/prep_dgp.py:L1294-L1303 diff_diff/prep_dgp.py:L1417-L1428 diff_diff/prep_dgp.py:L1565-L1578 diff_diff/prep_dgp.py:L1624-L1666 docs/methodology/REGISTRY.md:L2695-L2700
The new survey test block does not cover the panel=False interaction, so the P1 above can ship untested. tests/test_power.py:L2137-L2513

Methodology

Re-review note: the earlier P0 around survey DGP truth mismatch from covariate-interaction heterogeneity is resolved by the new explicit rejection path. diff_diff/power.py:L1916-L1945 tests/test_power.py:L2398-L2423

Severity P1. Impact: survey_config now allows data_generator_kwargs["panel"], but the wrapper never checks whether the selected estimator can validly consume repeated cross-sections. generate_survey_did_data(panel=False) is only documented for repeated-cross-section use and explicitly points to CallawaySantAnna(panel=False), while estimators like TwoWayFixedEffects, SunAbraham, and EfficientDiD document panel or balanced-panel input. This is a missing assumption check on a new public parameter interaction and can produce invalid survey-power simulations or runtime failures for reachable inputs. docs/methodology/REGISTRY.md:L2258-L2258 diff_diff/power.py:L1927-L1951 diff_diff/power.py:L2065-L2138 diff_diff/prep_dgp.py:L1251-L1255 diff_diff/twfe.py:L70-L85 diff_diff/sun_abraham.py:L465-L481 diff_diff/efficient_did.py:L311-L335 Concrete fix: reject panel=False for the panel-only survey-power estimators, or narrow the documented support matrix to the subset explicitly verified for repeated cross-sections.
Severity P2. Impact: the new covariate_effects guard is justified as preventing divergence between the realized population ATT and the scalar treatment_effect, but the DGP’s population_att is computed from true_effect, and true_effect does not depend on covariate_effects. Blocking it unnecessarily removes a documented DGP control and gives users the wrong methodological explanation. diff_diff/power.py:L1946-L1950 diff_diff/prep_dgp.py:L1294-L1303 diff_diff/prep_dgp.py:L1417-L1428 diff_diff/prep_dgp.py:L1565-L1578 diff_diff/prep_dgp.py:L1624-L1666 docs/methodology/REGISTRY.md:L2695-L2700 Concrete fix: allow covariate_effects when treatment-effect heterogeneity remains disabled, or document an intentional restriction with a rationale consistent with the DGP semantics.
Severity P3. Impact: no action required on the analytical deff approximation itself. The Registry explicitly documents Var(ATT) *= deff and n_total *= deff, so this is a documented approximation rather than a methodology defect. docs/methodology/REGISTRY.md:L2257-L2258 diff_diff/power.py:L1240-L1537 Concrete fix: none.

Code Quality

No new findings in the changed code.

Performance

No new findings in the changed code.

Maintainability

No new findings in the changed code.

Tech Debt

No existing TODO.md entry mitigates the P1/P2 items above, so they remain unmitigated. TODO.md:L51-L97

Security

No new findings in the changed code.

Documentation/Tests

Severity P2. Impact: the new survey tests cover supported-estimator smoke paths, unsupported-estimator rejection, heterogeneity rejection, metadata, and the sample-size floor fix, but they do not exercise panel=False under survey_config. The estimator/DGP compatibility gap above therefore has no regression coverage in this PR. tests/test_power.py:L2137-L2513 Concrete fix: add at least one negative regression for a panel-only estimator with data_generator_kwargs={"panel": False} and one positive repeated-cross-section case if support is intended.
Validation note: I could not run the tests locally because pytest is not installed in this environment.

Path to Approval

Add estimator-specific validation so survey_config cannot generate panel=False repeated cross-sections for panel-only estimators; the error should name the actually supported repeated-cross-section subset.
Update the survey_config registry/docstring text so panel is not advertised as a generic passthrough for all survey-supported estimators unless that matrix is truly supported.
Add regression coverage for the rejected panel-only combination and, if applicable, an allowed repeated-cross-section combination such as CallawaySantAnna(panel=False).

…fects block - P1: reject panel=False in data_generator_kwargs for panel-only estimators (TWFE, SA, Imputation, TwoStage, Stacked, Efficient); only CS supports repeated cross-sections under survey_config - P2: remove over-broad covariate_effects rejection (it affects baseline outcomes and ICC calibration, not realized population_att) - Add regression tests: panel=False rejected for TWFE, allowed for CS(panel=False) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T18:19:20Z

/ai-review

github-actions · 2026-04-12T18:27:00Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a39a96f36030b81e32eda29525720f239f9cbf52

Overall Assessment

⚠️ Needs changes

Highest unmitigated finding: the prior panel=False survey-power blocker is only partially fixed. The PR now rejects panel-only estimator classes, but it still does not enforce panel-mode alignment between the survey DGP and CallawaySantAnna.panel.

Executive Summary

Re-review: the earlier over-broad covariate_effects rejection appears resolved. The survey path now blocks only effect-heterogeneity settings that really break the scalar-truth accounting (heterogeneous_te_by_strata and nonzero te_covariate_interaction). diff_diff/power.py:L1916-L1945
The analytical deff change is methodologically acceptable as implemented. The Registry explicitly documents multiplicative variance and sample-size inflation, and the code applies that rule consistently through power/mde/sample_size and the convenience wrappers. docs/methodology/REGISTRY.md:L2257-L2257, diff_diff/power.py:L1240-L1531, diff_diff/power.py:L2950-L3096
The remaining blocker is panel-mode alignment. survey_config only gates panel=False by estimator class name, so both CallawaySantAnna() + repeated-cross-section DGP and CallawaySantAnna(panel=False) + default panel DGP remain reachable. diff_diff/power.py:L1946-L1954, diff_diff/power.py:L2068-L2142
That mismatch conflicts with the source contracts: the survey DGP documents panel=False only for CallawaySantAnna(panel=False), while CallawaySantAnna.fit() enforces different unit-ID requirements for panel and repeated-cross-section modes. diff_diff/prep_dgp.py:L1251-L1255, diff_diff/staggered.py:L1390-L1455, diff_diff/staggered.py:L1760-L1763
New tests only cover panel-only rejection and the fully aligned repeated-cross-section happy path; they do not cover either misaligned CallawaySantAnna combination. tests/test_power.py:L2425-L2448

Methodology

Re-review status: the prior covariate_effects concern looks addressed, but the prior panel=False methodology/assumption issue is not fully closed.

Severity P1. Impact: survey_config exposes panel as a survey-DGP passthrough, but the validation only checks estimator class name. That leaves two unsupported combinations reachable: CallawaySantAnna() with data_generator_kwargs={"panel": False} and CallawaySantAnna(panel=False) with the default survey DGP (panel=True). The survey DGP explicitly says repeated cross-sections are for CallawaySantAnna(panel=False), and CallawaySantAnna.fit() uses mutually exclusive panel vs RCS contracts. In practice this means the new survey-power API still accepts inputs that should be rejected up front and instead fail later during fitting/simulation. Concrete fix: determine the effective survey DGP panel flag before simulation and require it to match CallawaySantAnna.panel exactly, or infer it automatically from the estimator and reject any override mismatch. diff_diff/power.py:L1946-L1954, diff_diff/power.py:L2068-L2142, diff_diff/prep_dgp.py:L1251-L1255, diff_diff/staggered.py:L1390-L1455, diff_diff/staggered.py:L1760-L1763
Severity P3. Impact: no action required on deff. The implementation matches the Registry’s documented approximation, so this is a documented methodology choice rather than a defect. Concrete fix: none. docs/methodology/REGISTRY.md:L2257-L2257, diff_diff/power.py:L1240-L1531

Code Quality

No new findings in the changed code beyond the panel-alignment validation gap above.

Performance

No new findings in the changed code.

Maintainability

No new findings in the changed code.

Tech Debt

No TODO.md entry under “Tech Debt from Code Reviews” covers the panel-alignment or docs/test gap above, so they remain unmitigated. TODO.md:L51-L97

Security

No new findings in the changed code.

Documentation/Tests

Severity P2. Impact: the new survey tests assert rejection for TwoWayFixedEffects() with panel=False and a positive case for CallawaySantAnna(panel=False) with panel=False, but they do not cover the two misaligned CallawaySantAnna configurations that remain reachable. The new Registry note also still presents panel as a generic survey-DGP passthrough, instead of documenting the required coupling between panel=False and CallawaySantAnna(panel=False). Concrete fix: add explicit negative tests for both misaligned CS combinations, and tighten the survey-power docs to say that repeated-cross-section survey power is only supported for CallawaySantAnna(panel=False) with a matching panel=False DGP (unless the code is changed to infer that automatically). tests/test_power.py:L2425-L2448, docs/methodology/REGISTRY.md:L2258-L2258
Validation note: I could not run the test suite in this environment because the local Python environment is missing numpy, so this is a source-only review.

Path to Approval

Add upfront validation in simulate_power() so the survey DGP’s effective panel mode and CallawaySantAnna.panel cannot disagree. Reject both mismatch directions before entering the simulation loop, or infer the DGP mode directly from the estimator and reject conflicting overrides.
Add regression tests for:
1. CallawaySantAnna() with survey_config and data_generator_kwargs={"panel": False}.
2. CallawaySantAnna(panel=False) with survey_config and no panel=False DGP override.
Update the survey_config Registry/docstring text so panel is documented as a constrained estimator/DGP interaction, not a generic passthrough for all survey-supported configurations.

P1: validate that CallawaySantAnna.panel matches the survey DGP's panel flag in both directions: - CS(panel=True) + DGP panel=False -> rejected - CS(panel=False) + default DGP (panel=True) -> rejected - Panel check now runs even with empty data_generator_kwargs Add regression tests for both misaligned CS combinations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T18:43:30Z

/ai-review

github-actions · 2026-04-12T18:54:01Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 904a4d77286b1be2df717b561811c4e950d8f73f

Overall Assessment

⚠️ Needs changes

Highest unmitigated finding: Severity P1 [Newly identified]. survey_config is documented as automatically injecting the survey design into estimator fit(), but the injected design is then blindly overwritten by estimator_kwargs. That lets callers silently turn off or weaken survey-aware inference while still getting survey-DGP metadata back in the result object. diff_diff/power.py:L61-L64, diff_diff/power.py:L150-L160, diff_diff/power.py:L1815-L1821, diff_diff/power.py:L2151-L2158, diff_diff/power.py:L2265-L2267, docs/methodology/REGISTRY.md:L2258-L2258

Executive Summary

Re-review status: the prior panel-alignment blocker is fixed. The code now rejects both CallawaySantAnna() + repeated-cross-section DGP and CallawaySantAnna(panel=False) + default panel DGP, and tests cover both mismatch directions. diff_diff/power.py:L1947-L1970, tests/test_power.py:L2450-L2471
The analytical deff change is methodologically acceptable. The Registry explicitly documents the multiplicative approximation, and the code applies it consistently through variance, MDE, sample-size, and convenience wrappers. docs/methodology/REGISTRY.md:L2257-L2257, diff_diff/power.py:L1240-L1537, diff_diff/power.py:L2964-L3110
Severity P1 [Newly identified]: survey_config can be silently desynchronized from survey inference because estimator_kwargs overwrites the injected survey_design. Passing survey_design=None or a weights-only design would change the SE/power method away from the documented survey-aware workflow with no warning. diff_diff/power.py:L1815-L1821, diff_diff/power.py:L2151-L2158
Severity P2: SurveyPowerConfig does not mirror the survey DGP’s own conflict validation for icc vs non-default psu_re_sd and weight_cv vs non-default weight_variation, so contradictory public configs fail late during simulation instead of at config construction. diff_diff/power.py:L127-L148, diff_diff/prep_dgp.py:L1369-L1385
The new tests close the earlier panel mismatch gap, but the Registry note still describes panel as a generic passthrough and does not explicitly state the CallawaySantAnna(panel=False) coupling. That is documentation-only now. docs/methodology/REGISTRY.md:L2258-L2258, tests/test_power.py:L2425-L2471
Validation note: this is a source-only review. I could not run pytest here because the environment does not have the pytest module installed.

Methodology

Severity P1 [Newly identified]. Impact: the documented survey-aware method is “survey DGP + injected SurveyDesign at fit time,” but the implementation immediately does fit_kwargs.update(est_kwargs). That means callers can silently replace the full survey design with None or a partial design, changing variance/SE computation and therefore power while the result object still carries survey_config and survey DGP diagnostics. Concrete fix: when survey_config is set, reject estimator_kwargs["survey_design"] outright and require custom survey design overrides to go through SurveyPowerConfig.survey_design. diff_diff/power.py:L61-L64, diff_diff/power.py:L95-L98, diff_diff/power.py:L1815-L1821, diff_diff/power.py:L2151-L2158, diff_diff/power.py:L2265-L2267
Severity P3. Impact: no action required. The closed-form deff approximation is documented in the Methodology Registry, so the analytical variance/sample-size inflation is a documented deviation rather than a defect. Concrete fix: none. docs/methodology/REGISTRY.md:L2257-L2257, diff_diff/power.py:L1240-L1537

Code Quality

Severity P2. Impact: SurveyPowerConfig.__post_init__() validates ranges, but not the underlying DGP’s mutually exclusive settings. SurveyPowerConfig(icc=..., psu_re_sd=...) and SurveyPowerConfig(weight_cv=..., weight_variation=...) are accepted even though generate_survey_did_data() rejects them, so the public wrapper validates less strictly than the function it wraps. Concrete fix: copy the two conflict checks from generate_survey_did_data() into SurveyPowerConfig.__post_init__() and add direct validation tests. diff_diff/power.py:L127-L148, diff_diff/prep_dgp.py:L1369-L1385, tests/test_power.py:L2532-L2550

Performance

No findings in the changed code.

Maintainability

No findings in the changed code.

Tech Debt

No existing TODO.md entry covers the P1/P2 items above, so they remain unmitigated for assessment purposes. TODO.md:L51-L97

Security

No findings in the changed code.

Documentation/Tests

Severity P3. Impact: the Registry note now documents survey_config, but it still describes panel as a generic passthrough instead of stating that panel=False is only supported with CallawaySantAnna(panel=False) and a matching DGP override. Since the code now rejects mismatches early, this is informational rather than blocking. Concrete fix: add one sentence to the Registry note and simulate_power() docstring making that coupling explicit. docs/methodology/REGISTRY.md:L2258-L2258, diff_diff/power.py:L1815-L1821, diff_diff/power.py:L1947-L1970, tests/test_power.py:L2450-L2471
Validation note: I could not run python -m pytest tests/test_power.py -k 'SurveyPower or deff' -q because pytest is not installed in this environment.

Path to Approval

When survey_config is set, reject estimator_kwargs containing survey_design instead of allowing it to overwrite the injected survey design.
Add regression tests showing that simulate_power(..., survey_config=..., estimator_kwargs={"survey_design": None}) and simulate_power(..., survey_config=..., estimator_kwargs={"survey_design": SurveyDesign(weights="weight")}) both fail fast with a clear ValueError.

… conflicts, docs - P1: reject estimator_kwargs["survey_design"] when survey_config is set (prevents silently overwriting injected design; use SurveyPowerConfig.survey_design) - P2: add icc/psu_re_sd and weight_cv/weight_variation mutual-exclusion validation to SurveyPowerConfig.__post_init__ (mirrors DGP validation) - P3: update REGISTRY.md survey_config note to document panel=False coupling with CallawaySantAnna and estimator_kwargs survey_design restriction - Add regression tests for all three guards Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T19:11:22Z

/ai-review

github-actions · 2026-04-12T19:17:24Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ce910c7fb9c7edf9d59b6157170596429f34155d

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain. The prior re-review blockers are addressed: simulate_power() now rejects estimator_kwargs["survey_design"] when survey_config is active, and the repeated-cross-section CallawaySantAnna(panel=False) coupling is now enforced in both mismatch directions and covered by tests. diff_diff/power.py:L1937-L1944, diff_diff/power.py:L1966-L1989, tests/test_power.py:L2450-L2471, tests/test_power.py:L2560-L2570

Executive Summary

Re-review status: the previous survey_design overwrite defect is fixed in code and has a regression test. diff_diff/power.py:L1937-L1944, tests/test_power.py:L2560-L2570
Re-review status: the survey repeated-cross-section / CallawaySantAnna(panel=False) mismatch is now rejected in both directions and tested. diff_diff/power.py:L1966-L1989, tests/test_power.py:L2425-L2471
Methodology adherence looks solid. The closed-form deff path matches the Registry’s documented multiplicative variance and sample-size inflation, and the survey DGP adapter correctly converts the public 0-indexed treatment_period to the survey DGP’s 1-indexed cohort periods. docs/methodology/REGISTRY.md:L2257-L2258, diff_diff/power.py:L336-L367, diff_diff/prep_dgp.py:L1210-L1215
Severity P2: SurveyPowerConfig still validates a narrower contract than the wrapped DGP. Non-finite weight_cv and non-finite/negative psu_period_factor can get past config construction and only fail later in the simulation loop. diff_diff/power.py:L127-L158, diff_diff/prep_dgp.py:L1359-L1385
Severity P2: non-finite deff values are not rejected up front. _validate_deff() only checks <= 0, so np.nan would bypass validation and violate the documented deff > 0 contract; current tests only cover deff < 1 warnings and deff=0. diff_diff/power.py:L1230-L1239, docs/methodology/REGISTRY.md:L2257-L2257, tests/test_power.py:L2518-L2528

Methodology

Severity P3. Impact: no action required. The analytical deff change is explicitly documented in the Methodology Registry as a multiplicative approximation, and the implementation applies it consistently in _compute_variance(), _compute_required_n(), and the public wrappers. Concrete fix: none. docs/methodology/REGISTRY.md:L2257-L2258, diff_diff/power.py:L1250-L1547
No other methodology findings in the changed code.

Code Quality

Severity P2. Impact: SurveyPowerConfig.__post_init__() still does not mirror the full validation contract of generate_survey_did_data(). In particular, the wrapper omits the DGP’s np.isfinite(weight_cv) and np.isfinite(psu_period_factor) and psu_period_factor >= 0 checks, so malformed public configs fail late instead of at construction time. Concrete fix: add the missing finiteness/non-negativity checks directly to SurveyPowerConfig.__post_init__() and cover them with constructor-level tests. diff_diff/power.py:L127-L158, diff_diff/prep_dgp.py:L1359-L1385, tests/test_power.py:L2532-L2558

Performance

No findings in the changed code.

Maintainability

No findings in the changed code.

Tech Debt

No findings in the changed code.

Security

No findings in the changed code.

Documentation/Tests

Severity P2. Impact: the new tests cover the resolved re-review regressions well, but they still miss the remaining malformed-input edge cases in the new analytical API. _validate_deff() only rejects deff <= 0, so non-finite deff values can slip through despite the documented deff > 0 contract; the current tests exercise only the warning path for deff < 1 and the deff=0 error case. Concrete fix: make _validate_deff() reject non-finite values explicitly and add regressions for deff=np.nan plus the missing SurveyPowerConfig finiteness cases noted above. diff_diff/power.py:L1230-L1239, docs/methodology/REGISTRY.md:L2257-L2257, tests/test_power.py:L2518-L2558

Validation note: I could not run python -m pytest tests/test_power.py -k 'SurveyPower or deff' -q in this environment because both pytest and numpy are unavailable.

- Reject non-finite deff (NaN, inf) in _validate_deff() - Add weight_cv finiteness and psu_period_factor non-negativity checks to SurveyPowerConfig.__post_init__ (mirrors DGP validation) - Add regression tests for all new edge cases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T19:27:14Z

/ai-review

github-actions · 2026-04-12T19:35:22Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4322648a3878cc7f18a7e223e43a5b86d5aded0c

Overall Assessment

⚠️ Needs changes

Executive Summary

The analytical deff work itself is methodologically aligned with the Registry: variance is inflated multiplicatively, required sample size is inflated multiplicatively, and non-finite deff values are now rejected up front. diff_diff/power.py:L1235-L1546, docs/methodology/REGISTRY.md:L2257-L2258, tests/test_power.py:L2475-L2538
P1: survey_config hard-codes a single-cohort survey DGP but also blocks the cohort_periods / never_treated_frac overrides that the existing power code relies on for control_group="not_yet_treated" and clean_control="strict". That leaves a supported API path whose survey DGP cannot be made to match the requested control-group design. diff_diff/power.py:L341-L372, diff_diff/power.py:L502-L520, diff_diff/power.py:L555-L613, diff_diff/power.py:L1951-L1957, docs/methodology/REGISTRY.md:L2255-L2258, tests/test_power.py:L1469-L1578
P1: simulate_sample_size() now allows strata_sizes through data_generator_kwargs even though n_units is the search variable; a fixed strata_sizes vector can only satisfy the survey generator at one candidate n_units, so the bisection/bracketing path can fail late on an input the docs currently advertise as allowed. diff_diff/power.py:L2823-L2852, diff_diff/prep_dgp.py:L1387-L1401, docs/methodology/REGISTRY.md:L2258-L2258
The new survey tests cover happy-path estimator support and the earlier validation fixes, but they do not pin down either of the failing parameter-interaction cases above. tests/test_power.py:L2142-L2588

Methodology

Severity P1. Impact: the new survey convenience API cannot produce a methodology-matching survey DGP for some already-supported staggered configurations. simulate_power() hard-codes cohort_periods=[treatment_period + 1] and never_treated_frac=1 - treatment_fraction, while _SURVEY_CONFIG_KEYS forbids users from overriding those via data_generator_kwargs. But the existing compatibility logic still treats control_group="not_yet_treated" and clean_control="strict" as configurations that need multi-cohort overrides, and the Registry explicitly documents those overrides as the way to match the intended design. Under survey_config, that documented fix is unavailable, so the new path can return power for a different control-group composition than the estimator setting implies. diff_diff/power.py:L341-L372, diff_diff/power.py:L502-L520, diff_diff/power.py:L555-L613, diff_diff/power.py:L1951-L1957, docs/methodology/REGISTRY.md:L2255-L2258, tests/test_power.py:L1469-L1578
Concrete fix: either allow cohort_periods / never_treated_frac passthrough when survey_config is active, or reject survey_config up front for estimator settings that require a multi-cohort survey DGP (control_group="not_yet_treated", clean_control="strict"). If the restriction is intentional, document it explicitly in the survey-power Registry note instead of leaving the current warning/fix text in place.
Severity P3. Impact: no action required. The analytical deff implementation matches the documented Registry deviation: Var(ATT) *= deff, n_total *= deff, and deff < 1 warns rather than errors. diff_diff/power.py:L1235-L1546, docs/methodology/REGISTRY.md:L2257-L2258
Concrete fix: none.

Code Quality

Severity P1. Impact: simulate_sample_size() has an unhandled search-critical parameter interaction on the new survey path. The PR’s Registry note says strata_sizes is an allowed passthrough under survey_config, but generate_survey_did_data() requires sum(strata_sizes) == n_units. Because simulate_sample_size() changes n_units during bracketing/bisection, any fixed strata_sizes list is only valid at one search point. The search will therefore fail late inside the simulation loop instead of rejecting the incompatible input up front. This is the same class of bug the file already handles for TripleDifference/n_per_cell. diff_diff/power.py:L2806-L2813, diff_diff/power.py:L2823-L2852, diff_diff/prep_dgp.py:L1387-L1401, docs/methodology/REGISTRY.md:L2258-L2258
Concrete fix: add the same style of upfront guard used for n_per_cell: reject strata_sizes in simulate_sample_size() when the built-in survey DGP is active, unless you also implement a way to regenerate/rescale strata_sizes for each candidate n_units.

Performance

No findings in the changed code.

Maintainability

No separate findings beyond the survey_config contract issues above.

Tech Debt

No existing TODO.md entry mitigates the P1 issues above; they remain untracked. TODO.md:L51-L77

Security

No findings in the changed code.

Documentation/Tests

Severity P2. Impact: the new tests exercise survey happy paths, unsupported estimators, and validation basics, but they do not cover the two most fragile new interactions: survey_config with staggered estimators that require a multi-cohort control design, and simulate_sample_size() with allowed survey passthroughs that depend on n_units such as strata_sizes. That leaves both P1 behaviors above unpinned. tests/test_power.py:L1469-L1578, tests/test_power.py:L2142-L2588
Concrete fix: add explicit regression tests for the chosen behavior: either successful multi-cohort survey overrides or clear ValueErrors for control_group="not_yet_treated" / clean_control="strict" under survey_config, plus a simulate_sample_size(..., data_generator_kwargs={"strata_sizes": ...}) test that asserts upfront rejection.
Validation note: I could not execute the test suite in this environment because importing the package fails immediately with ModuleNotFoundError: numpy.

Path to Approval

Make survey_config either support multi-cohort survey DGP overrides (cohort_periods, never_treated_frac) or explicitly reject estimator settings that require them, instead of warning and then blocking the documented fix.
Add an upfront simulate_sample_size() guard for survey strata_sizes (or implement per-candidate regeneration) so search-critical DGP inputs cannot fail only after bisection starts.
Add regressions for survey_config + CallawaySantAnna(control_group="not_yet_treated"), survey_config + StackedDiD(clean_control="strict"), and simulate_sample_size() + survey strata_sizes, and update the survey-power Registry note to match the final behavior.

- P1: reject control_group='not_yet_treated' and clean_control='strict' with survey_config (require multi-cohort DGP that survey path doesn't support) - P1: reject strata_sizes in data_generator_kwargs for simulate_sample_size (sum(strata_sizes) == n_units but n_units varies during bisection) - Update REGISTRY.md survey-power note to document these restrictions - Add regression tests for all three rejection paths Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T19:45:13Z

/ai-review

github-actions · 2026-04-12T19:54:05Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c5f9b3c589a6f56bb3ce9405cd62a35e02139037

Overall Assessment

⚠️ Needs changes — the previous P1s from the last review are addressed, but one newly identified P1 remains.

Executive Summary

Affected methods: analytical power formulas in PowerAnalysis / compute_*, and the new survey-aware simulation path in simulate_power, simulate_mde, and simulate_sample_size.
The prior re-review blockers are fixed: survey_config now rejects control_group="not_yet_treated" / clean_control="strict", and simulate_sample_size() now rejects fixed strata_sizes, with matching Registry notes and tests.
The analytical deff implementation is aligned with the documented Registry note: variance and required sample size are inflated multiplicatively, wrappers propagate deff, and non-finite values are rejected.
Severity P1 [Newly identified]. survey_config still accepts EfficientDiD(control_group="last_cohort"), even though the hard-coded single-cohort survey DGP cannot satisfy that control-group design; this fails late after the simulation loop instead of rejecting up front.
Severity P2. SurveyPowerConfig validation is still incomplete for some scalar inputs (psu_re_sd, non-finite fpc_per_stratum), so invalid configs can slip past construction-time validation and only fail deeper in generation / survey resolution.

Methodology

Severity P1 [Newly identified]. Impact: the new survey-power path still has an undocumented control-group mismatch for EfficientDiD(control_group="last_cohort"). The survey adapter explicitly rejects other multi-cohort-only settings, but not this one; meanwhile EfficientDiD itself requires at least two treatment cohorts for last_cohort, and the survey DGP is hard-coded to a single cohort while blocking cohort_periods overrides. That leaves an advertised survey-aware power path that cannot represent the requested identification strategy and will only surface as “all simulations failed” after exhausting the loop. diff_diff/power.py:L1995-L2011, diff_diff/efficient_did.py:L456-L467, docs/methodology/REGISTRY.md:L794-L795, docs/methodology/REGISTRY.md:L2258-L2258
Concrete fix: reject control_group="last_cohort" up front when survey_config is set, or support multi-cohort survey overrides on that path; then document the restriction in the survey-power Registry note.
Severity P3. Impact: no action required. The analytical deff extension is methodologically consistent with the Registry note: Var(ATT) *= deff, n_total *= deff, and convenience wrappers propagate deff through the public API. diff_diff/power.py:L1235-L1759, diff_diff/power.py:L3017-L3163, docs/methodology/REGISTRY.md:L2257-L2257
Concrete fix: none.

Code Quality

Severity P2. Impact: SurveyPowerConfig validates some survey scalars but still does not reject invalid psu_re_sd or non-finite fpc_per_stratum at construction time. Those values are passed through into the survey DGP and survey-design resolution, so bad configs can fail only after simulation starts instead of raising an immediate ValueError. diff_diff/power.py:L127-L163, diff_diff/prep_dgp.py:L1325-L1331, diff_diff/prep_dgp.py:L1480-L1487, diff_diff/survey.py:L324-L332
Concrete fix: add upfront checks in SurveyPowerConfig.__post_init__ for np.isfinite(self.psu_re_sd) and self.psu_re_sd >= 0 and np.isfinite(self.fpc_per_stratum).

Performance

No findings in the changed code.

Maintainability

No separate findings beyond the validation gaps already noted.

Tech Debt

Severity P3. Impact: no mitigating TODO.md entry applies to the P1 above. The nearby EfficientDiD last_cohort TODO is about anticipation trimming, not survey-power compatibility, so it does not downgrade review severity. TODO.md:L62-L65
Concrete fix: none for mitigation purposes; this needs a code/doc change.

Security

No findings in the changed code.

Documentation/Tests

Severity P2. Impact: the new survey tests cover default EfficientDiD() and the previously raised rejection paths, but they do not pin down the newly identified EfficientDiD(control_group="last_cohort") incompatibility, and they also do not cover the missing SurveyPowerConfig scalar-validation cases above. The new survey-power Registry note similarly presents EfficientDiD as supported without calling out this unsupported control-group variant. tests/test_power.py:L2201-L2215, tests/test_power.py:L2542-L2580, tests/test_power.py:L2594-L2631, docs/methodology/REGISTRY.md:L2258-L2258
Concrete fix: add explicit regressions for simulate_power(EfficientDiD(control_group="last_cohort"), survey_config=...) and for invalid SurveyPowerConfig scalars such as psu_re_sd=-1/np.nan and fpc_per_stratum=np.nan/np.inf, and update the survey-power note accordingly.
Validation note: I could not run the package tests in this environment because numpy, pandas, and scipy are unavailable. I did syntax-compile the changed Python files successfully.

Path to Approval

Add an upfront survey_config rejection for EfficientDiD(control_group="last_cohort") in simulate_power() so the unsupported single-cohort survey design fails immediately and clearly.
Update the survey-power Registry note to list control_group="last_cohort" alongside the other survey-path restrictions.
Add regression tests for the last_cohort rejection and for the missing SurveyPowerConfig scalar-validation cases.

- P1: reject control_group='last_cohort' (EfficientDiD) with survey_config (needs multi-cohort DGP, same as not_yet_treated) - P2: add psu_re_sd and fpc_per_stratum finiteness validation to SurveyPowerConfig.__post_init__ - Update REGISTRY.md to list last_cohort alongside other restrictions - Add regression tests for last_cohort rejection and scalar validation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T20:06:51Z

/ai-review

github-actions · 2026-04-12T20:13:35Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b91a48549f932aa8cf8be02b80be7c3b07626a55

Overall Assessment

Needs changes. The prior re-review blockers around survey-path control-group rejection and SurveyPowerConfig scalar validation appear resolved, and the analytical deff addition matches the documented Registry note. One unmitigated P1 remains in the new survey sample-size path.

Executive Summary

Affected methods: analytical PowerAnalysis.power/mde/sample_size and compute_power/compute_mde/compute_sample_size, plus simulate_power, simulate_mde, and simulate_sample_size with survey_config.
The previous P1s look addressed: the survey path now rejects control_group="last_cohort" / clean_control="strict" up front, the Registry note documents those restrictions, and SurveyPowerConfig now validates psu_re_sd and fpc_per_stratum.
The closed-form deff implementation is methodologically consistent with the new Registry note: variance and required sample size are inflated multiplicatively, and the public wrappers propagate deff.
Severity P1: explicit n_range still bypasses the new survey minimum-N floor because _snap_n() ignores its floor argument for non-DDD searches, so simulate_sample_size() can return infeasible survey sample sizes below survey_config.min_viable_n.
Severity P2: the new tests cover auto-bracketing floor enforcement, but not the explicit n_range branch where the regression above lives.

Methodology

Severity P1. Impact: simulate_sample_size(..., survey_config=..., n_range=(lo, hi)) can silently evaluate and return n below the documented survey viability floor n_strata * psu_per_stratum * 2. _snap_n() returns n unchanged whenever grid_step == 1, so the explicit-bracket branch never applies abs_min for survey-enabled searches. That violates the new Registry contract for survey-aware power analysis and can produce infeasible sample-size recommendations. Concrete fix: make _snap_n() honor floor even when grid_step == 1, and clamp explicit n_range against max(min_n, survey_config.min_viable_n) before any power evaluation. Location: diff_diff/power.py:L2836-L2842, diff_diff/power.py:L2886-L2895, docs/methodology/REGISTRY.md:L2258
Severity P3. Impact: no action required. The analytical deff extension is consistent with the documented Registry note: Var(ATT) *= deff, n_total *= deff, and the convenience wrappers propagate deff through the public API. Concrete fix: none. Location: diff_diff/power.py:L1238-L1556, diff_diff/power.py:L3021-L3163, docs/methodology/REGISTRY.md:L2257

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. No TODO.md entry mitigates the P1 above, so it remains blocking for approval.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new survey sample-size tests only cover the auto-bracketing path and never exercise explicit n_range with survey_config, so the below-floor regression is unpinned. Concrete fix: add a regression such as simulate_sample_size(..., survey_config=SurveyPowerConfig(n_strata=5, psu_per_stratum=8), n_range=(10, 20)) and assert the function either clamps to >= 80 or raises a clear ValueError for an infeasible explicit bracket; add a mixed bracket like (10, 200) as well to verify both endpoints are clamped before evaluation. Location: tests/test_power.py:L2342-L2396
Validation note: I could not execute the test suite in this environment because numpy and the scientific Python stack are unavailable.

Path to Approval

Make the explicit-bracket path in simulate_sample_size() enforce the survey floor for non-DDD searches by applying max(min_n, abs_min) even when grid_step == 1.
Add explicit-bracket regression tests for survey_config so n_range cannot return or probe n < survey_config.min_viable_n again.

- P1: fix _snap_n() to honor floor even when grid_step==1 (non-DDD), so explicit n_range with survey_config is clamped to min_viable_n - Add regression test for n_range=(10, 200) with min_viable_n=80 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T20:25:01Z

/ai-review

github-actions · 2026-04-12T20:31:02Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 87661d7943bdcc1b1e9db77bb4308f46891d8ce4

Overall Assessment

✅ Looks good.

The prior re-review blocker on survey simulate_sample_size(..., n_range=...) floor enforcement is fixed, and I did not find a new unmitigated P0/P1 in the changed diff.

Executive Summary

Affected methods: analytical PowerAnalysis.power(), mde(), sample_size() and compute_power(), compute_mde(), compute_sample_size(), plus simulation simulate_power(), simulate_mde(), and simulate_sample_size() via survey_config.
The previous sample-size blocker is fixed: _snap_n() now honors the survey floor even when grid_step == 1, and the explicit n_range branch clamps to survey_config.min_viable_n. See diff_diff/power.py:L2836-L2943 and tests/test_power.py:L2398-L2415.
The analytical deff addition is methodologically consistent with the Registry note: variance inflation and required-sample inflation are both multiplicative, and deff is kept distinct from panel rho. See diff_diff/power.py:L1238-L1633 and docs/methodology/REGISTRY.md:L2257-L2258.
The survey simulation path now correctly rejects unsupported estimators and single-cohort-incompatible settings (not_yet_treated, last_cohort, clean_control="strict"), which resolves the earlier control-group concern. See diff_diff/power.py:L1910-L2015.
I could not execute the tests locally because this environment does not have numpy or pytest.

Methodology

Severity P3 (informational). Impact: No action required. The new analytical deff behavior is a documented approximation, not an undocumented methodology deviation. _compute_variance() multiplies DiD variance by deff, _compute_required_n() multiplies required sample size by deff, and the Registry explicitly documents that approximation and its distinction from rho. Concrete fix: none. Locations: diff_diff/power.py:L1238-L1633, docs/methodology/REGISTRY.md:L2257-L2258.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity P3 (informational). Impact: No action required. The added tests now cover the previously missing explicit-n_range survey-floor regression, along with the supported/unsupported estimator matrix and the main survey_config/deff validation paths. Concrete fix: none. Locations: tests/test_power.py:L2141-L2673.
Validation note: I could not run the test suite in this environment because numpy and pytest are unavailable.

igerber and others added 2 commits April 12, 2026 12:22

igerber added the ready-for-ci Triggers CI test workflows label Apr 12, 2026

igerber merged commit 8ca80e0 into main Apr 12, 2026
23 of 24 checks passed

igerber deleted the survey-aware-power branch April 12, 2026 21:50

Conversation

igerber commented Apr 12, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant