Add Phase 2 multi-horizon event study for dCDH estimator by igerber · Pull Request #294 · igerber/diff-diff

igerber · 2026-04-12T16:47:01Z

Summary

Add multi-horizon event study DID_l for l = 1..L_max to ChaisemartinDHaultfoeuille via the L_max parameter (ROADMAP items 2a-2h)
Per-group DID_{g,l} building block (Eq 3 of dynamic companion paper, NBER WP 29873)
Per-horizon cohort-recentered analytical SE (Web Appendix Section 3.7.3)
Dynamic placebos DID^{pl}_l with dual eligibility condition (Section 1.1)
Normalized estimator DID^n_l = DID_l / delta^D_l (Section 3.2)
Cost-benefit aggregate delta (Section 3.3, Lemma 4) - becomes overall_att when L_max > 1
Sup-t simultaneous confidence bands via multiplier bootstrap
plot_event_study() integration with <50% switcher warning
to_dataframe("event_study") and to_dataframe("normalized") output levels
Per-horizon bootstrap with SE/CI/p-value propagation to event_study_effects
R DIDmultiplegtDYN parity tests at multiple horizons (4 new scenarios)
L_max=None (default) preserves exact Phase 1 behavior (142 tests, 0 failures)

Methodology references (required if estimator / math changes)

Method name(s): ChaisemartinDHaultfoeuille (DCDH) - multi-horizon DID_l, normalized DID^n_l, cost-benefit delta, dynamic placebos DID^{pl}_l
Paper / source link(s): de Chaisemartin, C. & D'Haultfoeuille, X. (2022, revised 2024). Difference-in-Differences Estimators of Intertemporal Treatment Effects. NBER Working Paper 29873. Equations 3, 5, 15; Lemma 4; Web Appendix Section 1.1, Section 3.7.3.
Any intentional deviations from the source (and why): Cell-count weighting (Phase 1 deviation, carried forward); period-based vs cohort-based stable controls at l=1 (Phase 1 deviation, carried forward); per-group DID_{g,l} path at l>=2 uses the paper's cohort-based controls naturally

Validation

Tests added/updated: tests/test_chaisemartin_dhaultfoeuille.py (+18 Phase 2 tests), tests/test_chaisemartin_dhaultfoeuille_parity.py (+4 multi-horizon R parity tests)
R parity: 9 scenarios (5 Phase 1 + 4 Phase 2), all passing against R DIDmultiplegtDYN v2.3.3
Golden values regenerated: benchmarks/data/dcdh_dynr_golden_values.json

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…uille Implements ROADMAP items 2a-2h: multi-horizon DID_l via per-group DID_{g,l} building block (Eq 3 of dynamic paper), per-horizon analytical SE, dynamic placebos DID^{pl}_l, normalized DID^n_l, cost-benefit aggregate delta, sup-t simultaneous confidence bands, plot_event_study() integration, and R DIDmultiplegtDYN parity tests at multiple horizons. L_max parameter controls multi-horizon mode; L_max=None preserves exact Phase 1 behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-12T16:53:44Z

Overall Assessment

⛔ Blocker

Executive Summary

Affected methodology: ChaisemartinDHaultfoeuille Phase 2 multi-horizon DID_l, dynamic placebos DID^{pl}_l, normalized effects DID^n_l, cost-benefit aggregate delta, and multiplier-bootstrap sup-t bands.
The new delta implementation does not match the cumulative-dose formula documented in the Methodology Registry, so overall_att becomes the wrong estimand when L_max > 1.
The new dynamic placebo helper reverses the pre-trend difference relative to the existing placebo convention, which likely negates DID^{pl}_l.
Sup-t bands are calibrated on horizons 2..L_max but then attached to l=1 as well, so the reported “simultaneous” band is miscalibrated.
The Phase 2 public surface is internally inconsistent at l=1: event_study_effects[1], normalized_effects[1], and top-level overall_att/labels can refer to different estimands, and the tests currently do not pin the new methodology-sensitive outputs tightly enough to catch that.

Methodology

Severity: P0. Impact: _compute_cost_benefit_delta() weights horizons with |D_{g,F_g-1+l} - D_{g,1}|, but the registry defines the normalization term cumulatively as delta^D_{g,l} = sum_{k=0}^{l-1}(D_{g,F_g+k} - D_{g,1}). Under binary treatment, the documented formula implies a horizon-l dose of l; the code uses 1, so longer horizons are underweighted and overall_att becomes the wrong cost-benefit delta with no warning. Concrete fix: derive w_l from the same cumulative delta^D_{g,l} object already used in _compute_normalized_effects(), and add a regression test that binary-treatment weights are proportional to l * N_l. Location: _compute_cost_benefit_delta in diff_diff/chaisemartin_dhaultfoeuille.py, Phase 2 formula in REGISTRY.md.
Severity: P0. Impact: _compute_multi_horizon_placebos() computes placebo changes as Y_{g,F_g-1-l} - Y_{g,F_g-1}, while the existing placebo implementation and the established placebo convention use reference-minus-pretrend (Y_{g,F_g-1} - Y_{g,F_g-1-l}). That flips the sign of DID^{pl}_l relative to the existing estimator contract. Concrete fix: reverse the subtraction in both the switcher and control terms, then add parity assertions against the Placebos table from the R fixture instead of checking only Effects. Location: _compute_placebo in diff_diff/chaisemartin_dhaultfoeuille.py, _compute_multi_horizon_placebos in diff_diff/chaisemartin_dhaultfoeuille.py.
Severity: P0. Impact: the sup-t critical value is computed only from multi-horizon bootstrap inputs for l=2..L_max, but fit() then applies that critical value to every entry in event_study_effects, including l=1. The reported band is therefore not simultaneous over the set of horizons it is shown for. Concrete fix: either include l=1 in the calibrated bootstrap horizon set or stop attaching cband_conf_int to l=1; add a regression test that the calibration set exactly matches the displayed band set. Location: multi-horizon bootstrap input construction in diff_diff/chaisemartin_dhaultfoeuille.py, sup-t critical value computation in diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py, band application in diff_diff/chaisemartin_dhaultfoeuille.py.
Severity: P1. Impact: the new multi-horizon DID and placebo paths silently zero out groups with empty control pools and retain them in N_l/N_pl_l, but unlike the Phase 1 A11 path they emit no warning. Because Phase 1 explicitly allows terminal missingness, this can happen when a would-be never-switching control exits before the required horizon. That is a missing assumption check on a methodology-sensitive code path. Concrete fix: detect ctrl_pool.size == 0, accumulate consolidated warnings parallel to the Phase 1 A11 machinery, and decide explicitly whether the paper’s convention here is zero-retention or dropping the affected switcher from the denominator. Location: _compute_multi_horizon_dids in diff_diff/chaisemartin_dhaultfoeuille.py, _compute_multi_horizon_placebos in diff_diff/chaisemartin_dhaultfoeuille.py, Phase 1 A11 contract in REGISTRY.md.

Code Quality

Severity: P1. Impact: the PR exposes inconsistent l=1 semantics on the public surface. event_study_effects[1] is hard-wired to the Phase 1 DID_M path, normalized_effects[1] uses the Phase 2 per-group DID_1, and its inference borrows overall_se from the Phase 1 path even when those estimands differ. Separately, fit() overwrites top-level overall_att with delta for L_max>1, but __repr__, summary(), and to_dataframe("overall") still label that field as DID_M. The tests even acknowledge a supposed “documented deviation” at l=1, but the registry does not add the required **Note:**/**Deviation ...** label for it. Concrete fix: choose one contract for L_max>=2 and make event_study_effects, normalized_effects, bootstrap propagation, and user-facing labels all follow it; if the Phase 1 carry-forward at l=1 is intentional, add the required registry note and stop reusing mismatched SEs. Location: event-study assembly in diff_diff/chaisemartin_dhaultfoeuille.py, top-level overwrite in diff_diff/chaisemartin_dhaultfoeuille.py, normalized SE path in diff_diff/chaisemartin_dhaultfoeuille.py, results surface in diff_diff/chaisemartin_dhaultfoeuille_results.py, summary label in diff_diff/chaisemartin_dhaultfoeuille_results.py, overall dataframe label in diff_diff/chaisemartin_dhaultfoeuille_results.py, test comment in tests/test_chaisemartin_dhaultfoeuille.py.

Performance

No findings.

Maintainability

No additional findings beyond the public-surface contract issue above.

Tech Debt

No relevant TODO.md entry mitigates the P0/P1 issues above, so they remain unmitigated. Location: TODO.md.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new tests mostly validate per-horizon point estimates and coarse shape checks. They do not pin dynamic placebo signs, cost-benefit weighting, the l=1 contract, or the horizon set used to calibrate sup-t bands, which is why the issues above can pass CI. Concrete fix: add R or hand-calculated checks for Placebos, a binary-treatment regression where w_l ∝ l * N_l, and a bootstrap test asserting that the sup-t calibration set equals the set of horizons receiving cband_conf_int. Location: Phase 2 unit tests in tests/test_chaisemartin_dhaultfoeuille.py, Phase 2 parity tests in tests/test_chaisemartin_dhaultfoeuille_parity.py.

Path to Approval

Reimplement delta weights from cumulative delta^D_{g,l} rather than one-period dose gaps, and add a binary-treatment regression test with expected l * N_l weighting.
Flip the dynamic placebo difference to reference-minus-preperiod, then extend the R fixture assertions to cover placebo outputs.
Recompute the sup-t critical value over the exact set of horizons that receive simultaneous bands, or stop attaching a band to l=1.
Make the L_max>=2 public contract internally consistent: event_study_effects[1], normalized_effects[1], bootstrap inference, and user-facing labels must all refer to the same estimand with matching SEs.
Add explicit A11/control-availability warnings for empty multi-horizon control pools, especially under terminal missingness.

…onsistency - Fix cost-benefit delta to use cumulative dose (sum_{k=0}^{l-1} |D_{g,F_g+k} - D_{g,1}|) instead of one-period dose; binary weights now proportional to l * N_l - Flip dynamic placebo sign to ref-minus-preperiod (Y_{ref} - Y_{backward}), matching the Phase 1 convention - Include l=1 in sup-t bootstrap calibration so bands are truly simultaneous over all horizons 1..L_max - Use per-group DID_{g,l} path for event_study_effects[1] when L_max >= 2, making all horizons use a consistent estimand - Label overall_att as "delta" in summary/to_dataframe when L_max > 1 - Add A11 control-availability warnings for multi-horizon empty control pools Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T17:06:08Z

/ai-review

github-actions · 2026-04-12T17:16:13Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7ed7fde5784ba56c3fa096eb42db8a0106229cf6

Overall Assessment

⛔ Blocker

Executive Summary

Re-review status: the earlier sup-t calibration blocker and the main Phase 2 l=1 estimand-propagation issue appear resolved in the changed code.
P0 remains: dynamic placebos are still sign-reversed relative to the cited Phase 2 placebo definition, so the new negative-horizon outputs report the wrong direction. citeturn0search23
P0 remains: cost-benefit delta still uses the wrong horizon weights, and overall_att is overwritten with that aggregate when L_max > 1. citeturn0search26
P1 remains: multi-horizon placebos still silently zero-retain empty control pools without the warning contract used elsewhere in dCDH.
The remaining non-blocking issue is public-surface labeling: some result helpers still call the top-level estimand DID_M even when Phase 2 repurposes it to hold delta.
Static review only: I could not run the Python test suite in this sandbox because the environment is missing numpy.

Methodology

Severity: P0. Impact: _compute_multi_horizon_placebos() still computes the placebo change as Y_ref - Y_backward for both switchers and controls, but the dynamic companion paper defines DID^{pl}_{g,l} using the backward-minus-reference change. That silently flips the sign of every reported DID^{pl}_l, and the registry does not document this as a deviation. Concrete fix: reverse both subtraction orders in _compute_multi_horizon_placebos() and extend parity coverage to assert the R Placebos fixture, not just Effects. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L2430-L2450, docs/methodology/REGISTRY.md:L518-L538. citeturn0search23
Severity: P0. Impact: _compute_cost_benefit_delta() still weights each horizon by cumulative dose up to l, not by Lemma 4’s horizon weight. Under a binary joiners-only design, Lemma 4 implies weights proportional to N_l; the current implementation makes them proportional to l * N_l, so longer horizons are overweighted. Because fit() overwrites overall_att with this delta when L_max > 1, this is a silent main-result error, not just a side-metric issue. The implementation also contradicts its own docstring. Concrete fix: compute w_l from Lemma 4’s per-horizon increment formula and add a binary-treatment regression that distinguishes N_l from l * N_l. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L2526-L2652, diff_diff/chaisemartin_dhaultfoeuille.py:L1483-L1500. citeturn0search26
Severity: P1. Impact: the new multi-horizon DID path now surfaces control-availability warnings, but the new placebo path still does not. _compute_multi_horizon_placebos() zeroes pl_g_l[g] when ctrl_pool.size == 0 and retains that group in N_pl_l, yet fit() never emits the consolidated warning contract used by _compute_placebo() and documented in the registry’s “no silent failures” / placebo-warning notes. On ragged or terminally missing panels, that can silently bias DID^{pl}_l toward zero. Concrete fix: accumulate placebo control-availability warnings inside _compute_multi_horizon_placebos() and surface them from fit() the same way Phase 1 placebo warnings and the new multi-horizon DID warnings are handled. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L2434-L2446, diff_diff/chaisemartin_dhaultfoeuille.py:L1140-L1152, diff_diff/chaisemartin_dhaultfoeuille.py:L1885-L1983, docs/methodology/REGISTRY.md:L586-L618.

Re-review note: the prior sup-t calibration finding appears fixed. The bootstrap now calibrates over horizons 1..L_max and applies bands to that same event-study set. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L1321-L1372, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L254-L301, diff_diff/chaisemartin_dhaultfoeuille.py:L1471-L1481.

Code Quality

Severity: P2. Impact: the remaining result surface is only partially updated for the new overall_att = delta contract. summary() and to_dataframe("overall") were updated, but the result docstring, __repr__, and to_dataframe("joiners_leavers") still label that same field as DID_M. The numeric value is correct, but the emitted estimand label is now wrong and can mislead downstream consumers. Concrete fix: make all labels conditional on L_max, or expose delta on a dedicated top-level field instead of overloading overall_att. Location: diff_diff/chaisemartin_dhaultfoeuille_results.py:L171-L173, diff_diff/chaisemartin_dhaultfoeuille_results.py:L411-L419, diff_diff/chaisemartin_dhaultfoeuille_results.py:L805-L817.

Performance

No findings.

Maintainability

No additional findings beyond the result-surface contract issue above.

Tech Debt

No mitigating entry in TODO.md:L51-L97 covers the P0/P1 issues above, so they remain unmitigated under the project’s deferral rules.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new regression coverage is still too shallow for the methodology-sensitive Phase 2 outputs. The R benchmark generator stores both Effects and Placebos, but the parity suite only asserts Effects; the unit tests only check placebo presence/shape and that delta weights sum to 1, not that placebos have the correct sign or that delta uses Lemma 4 weights. That is why both remaining methodology blockers can pass CI. Concrete fix: assert multi-horizon placebo parity against the committed JSON, add a binary-treatment delta regression with the expected Lemma 4 weighting, and add a small results-surface test for the renamed top-level estimand when L_max >= 2. Location: benchmarks/R/generate_dcdh_dynr_test_values.R:L294-L326, tests/test_chaisemartin_dhaultfoeuille_parity.py:L239-L258, tests/test_chaisemartin_dhaultfoeuille.py:L1937-L1966, tests/test_chaisemartin_dhaultfoeuille.py:L2026-L2057. citeturn0search26

Path to Approval

Reverse the Phase 2 placebo differences in _compute_multi_horizon_placebos() and add parity assertions over Placebos from benchmarks/data/dcdh_dynr_golden_values.json.
Reimplement _compute_cost_benefit_delta() using Lemma 4’s per-horizon weights and add a regression that distinguishes N_l from l * N_l under binary treatment.
Add consolidated warning plumbing for zero-retained multi-horizon placebo control-pool failures, mirroring Phase 1 placebo warnings and the new multi-horizon DID warning path.

…s, labels - Flip placebo to paper convention (backward - ref) for R parity - Revert cost-benefit delta to per-period dose (Lemma 4: w_l proportional to N_l for binary, not l * N_l) - Add placebo control-availability warnings mirroring DID path - Update __repr__ and joiners_leavers labels for delta when L_max > 1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T17:21:37Z

/ai-review

github-actions · 2026-04-12T17:32:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 32f7a5341941d886c1ee77b1363c86510224e0bd

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review status: the prior placebo-sign, Lemma 4 weighting, and multi-horizon placebo-warning findings look addressed in the updated helper code and warning plumbing.
P1: when L_max >= 2, the PR repurposes overall_att to the cost-benefit aggregate delta but clears the entire overall_* inference surface to NaN, so the new default top-level estimand ships without SE/CI/p-value.
P1: dynamic placebo horizons still ship as point estimates only; negative-horizon inference is hard-coded to NaN, and the new placebo-bootstrap plumbing is never connected.
The new R fixtures already capture placebo SE/CI, but the parity suite only asserts positive-horizon point estimates, so the inference gaps are not currently pinned.
Static review only: I could not run the Python tests in this sandbox because numpy is unavailable.

Methodology

Severity: P1. Impact: fit() now makes overall_att equal to Phase 2 cost-benefit delta, but it does so only after the legacy DID_M inference path has run, then force-resets overall_se, overall_t_stat, overall_p_value, and overall_conf_int to NaN. That leaves the new default overall estimand without any valid inference surface, even though the official did_multiplegt_dyn output exposes the average total / cumulative effect with estimate, SE, and CI columns. Concrete fix: compute delta inference from the same horizon-level IF/bootstrap draws and propagate it through overall_*, or keep overall_* bound to DID_M and expose delta separately until its inference is implemented. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L1387, diff_diff/chaisemartin_dhaultfoeuille.py:L1502, diff_diff/chaisemartin_dhaultfoeuille_results.py:L509. citeturn1search0turn13view0turn13view2
Severity: P1. Impact: the new dynamic placebos DID^{pl}_l still have no inference. placebo_event_study hard-codes pl_se = NaN, and although the bootstrap mixin now defines placebo_horizon_inputs, fit() never passes them. That leaves every negative-horizon placebo without SE/CI/p-value, despite the official reference surface and this PR’s own R fixture generator already producing placebo inference. Concrete fix: implement placebo IF/SE or wire the existing placebo-bootstrap hook end-to-end, then populate placebo_event_study, to_dataframe("event_study"), and plotting from those values; if intentionally deferred, keep placebo horizons out of the shipped Phase 2 inference surface and add a registry Note. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L1141, diff_diff/chaisemartin_dhaultfoeuille.py:L1515, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L303, benchmarks/R/generate_dcdh_dynr_test_values.R:L295. citeturn1search0turn13view1
Severity: P3. Impact: re-review note only: the earlier methodology blockers around placebo sign, Lemma 4 horizon weights, and missing multi-horizon placebo warnings appear fixed in the current diff. Concrete fix: none. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L2444, diff_diff/chaisemartin_dhaultfoeuille.py:L2566, diff_diff/chaisemartin_dhaultfoeuille.py:L1153.

Code Quality

Severity: P2. Impact: summary() still prints the generic “p-value and CI are multiplier-bootstrap percentile inference” note whenever bootstrap_results exists, even in Phase 2 where the displayed overall row is delta with all inference fields set to NaN. That makes the human-readable output claim bootstrap inference for an estimand that currently has none. Concrete fix: gate the note on non-NaN overall_* inference or split the note so it only describes the event-study horizons that actually received bootstrap propagation. Location: diff_diff/chaisemartin_dhaultfoeuille_results.py:L543.

Performance

No findings.

Maintainability

No additional findings beyond the inference-surface contract issues above.

Tech Debt

No mitigating entry under TODO.md:L51 covers the new Phase 2 delta/placebo inference gaps, so they remain unmitigated for this review.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new Phase 2 parity coverage only checks positive-horizon point estimates. The fixture generator already extracts placebo SE/CI from R, but the parity suite ignores placebos, and the unit tests only verify that placebo keys exist and delta weights sum to one. That is why the missing delta/placebo inference surfaces can pass CI. Concrete fix: add parity assertions for placebo effect/SE/CI, and add a regression that explicitly checks the overall_* surface when L_max > 1 (either populated if implemented, or explicitly documented as unavailable). Location: benchmarks/R/generate_dcdh_dynr_test_values.R:L313, tests/test_chaisemartin_dhaultfoeuille_parity.py:L239, tests/test_chaisemartin_dhaultfoeuille.py:L1937, tests/test_chaisemartin_dhaultfoeuille.py:L2026.
Severity: P3. Impact: the docs are still internally inconsistent about the shipped Phase 2 contract. The README/registry still say Phase 2 “will add” placebo bootstrap support, and the registry’s bootstrap-surface note still describes overall_* as the public bootstrap inference carrier, which is no longer true once overall_att becomes delta. Concrete fix: update the README and methodology registry after the final inference contract is settled. Location: README.md:L1255, docs/methodology/REGISTRY.md:L578, docs/methodology/REGISTRY.md:L584.

Path to Approval

Implement and propagate inference for the Phase 2 cost-benefit delta surfaced in overall_att, or stop overloading overall_* until that inference exists.
Implement dynamic-placebo inference end-to-end and populate placebo_event_study from it, or explicitly defer/remove the public negative-horizon inference surface and document that deferral in REGISTRY.md.
Extend the R parity and unit tests to assert placebo effect/SE/CI and the L_max > 1 overall_* inference contract.

…mary gate - Compute cost-benefit delta SE from per-horizon SEs via delta method: SE(delta) = sqrt(sum w_l^2 * SE(DID_l)^2), giving overall_att non-NaN inference when L_max > 1 - Document placebo SE NaN as intentional Phase 2 deferral in REGISTRY.md (placebo IF computation deferred; point estimates meaningful for visual pre-trends; bootstrap plumbing exists but not wired) - Gate summary() bootstrap note on non-NaN overall inference - Remove unused import Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T17:37:03Z

/ai-review

github-actions · 2026-04-12T17:48:18Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 00efe07e9bd145776ebaa1d965d9fa6d05a9445c

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review status: the earlier placebo-sign, Lemma 4 weighting-direction, and far-horizon warning issues appear addressed in the current diff.
P1: overall_att now becomes Phase 2 delta, but the new SE/CI/p-value path only runs when bootstrap results exist, so the default analytical path still leaves the new top-level estimand with NaN inference.
P1: the new Phase 2 DID_l / DID^{pl}_l / DID^n_l / delta code averages controls and switchers equally instead of using the paper’s outcome-period N_{g,t} weights, so unequal cell sizes silently change the estimand with no explicit Phase 2 Note/Deviation label. citeturn8view0
P1: the sup-t bands are built from independently bootstrapped horizons rather than one shared bootstrap replicate across the full event-study vector, so the simultaneous bands are not a valid joint multiplier-bootstrap band.
P3: the dynamic-placebo inference gap is now explicitly documented in REGISTRY.md, so under this review rubric it is no longer a blocker.
Static review only: I did not run the Python test suite in this read-only sandbox.

Methodology

Affected methods: ChaisemartinDHaultfoeuille Phase 2 DID_l, DID^{pl}_l, DID^n_l, delta, and sup-t bands.

Severity: P1. Impact: fit(..., L_max >= 2) overwrites overall_att with delta, but the new delta-inference block is guarded by if bootstrap_results is not None, so the default analytical path (n_bootstrap=0) still returns NaN for overall_se, overall_t_stat, overall_p_value, and overall_conf_int. That leaves the PR’s new default top-level estimand without inference unless bootstrap is turned on, even though the registry now says delta SE should be derived from per-horizon SEs. Concrete fix: compute delta_se from multi_horizon_se whenever cost_benefit_result is present, then optionally replace horizon SEs with bootstrap SEs when bootstrap is enabled. Location: diff_diff/chaisemartin_dhaultfoeuille.py#L1502, docs/methodology/REGISTRY.md#L540.
Severity: P1. Impact: Equation 3/5 in the dynamic paper weights both the control mean and the horizon aggregate by outcome-period N_{g,t} mass, but the new code uses an unweighted control .mean(), unweighted switcher counts via eligible.sum(), and carries the same equal-group logic into the Phase 2 IF/placebo/normalized/delta paths. On microdata or any panel with uneven (group, time) cell sizes, the shipped Phase 2 outputs estimate a different estimand than the cited method, and there is no explicit Phase 2 Note/Deviation label carrying forward the Phase 1 equal-cell contract. Concrete fix: either implement N_mat[:, out_idx] weights end-to-end in _compute_multi_horizon_dids(), _compute_per_group_if_multi_horizon(), _compute_multi_horizon_placebos(), _compute_normalized_effects(), and _compute_cost_benefit_delta(), or add an explicit REGISTRY Note/Deviation label that equal-cell weighting is intentional for all Phase 2 estimands and pin it with an unequal-cell regression test. Location: diff_diff/chaisemartin_dhaultfoeuille.py#L2221, diff_diff/chaisemartin_dhaultfoeuille.py#L2331, diff_diff/chaisemartin_dhaultfoeuille.py#L2426, docs/methodology/REGISTRY.md#L493, docs/methodology/REGISTRY.md#L518. citeturn8view0
Severity: P1. Impact: the new sup-t band code bootstraps each horizon independently by calling _bootstrap_one_target() in a loop, and _bootstrap_one_target() generates a fresh weight matrix on every call. That destroys replicate alignment across horizons, so sup_t_dist[b] is built from unrelated draws rather than one joint bootstrap replicate of the event-study vector. The resulting cband_conf_int is therefore not a valid multiplier-bootstrap simultaneous band. Concrete fix: generate one shared (n_bootstrap, n_groups) multiplier matrix for the full event-study IF matrix and compute all horizon perturbations from that same draw, mirroring the existing shared-weight pattern already used in staggered_bootstrap.py. Location: diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py#L255, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py#L369, diff_diff/staggered_bootstrap.py#L350, diff_diff/staggered_bootstrap.py#L498.
Severity: P3. Impact: re-review note only: the dynamic-placebo inference gap is now explicitly documented as a Phase 2 limitation, so I am not counting it as a blocker under the stated review rules. Concrete fix: none required for approval in this PR unless formal placebo inference is in scope. Location: docs/methodology/REGISTRY.md#L542, diff_diff/chaisemartin_dhaultfoeuille.py#L1537.

Code Quality

No additional findings beyond the methodology-contract issues above.

Performance

No findings.

Maintainability

No additional findings.

Tech Debt

No existing entry under TODO.md#L51 covers the Phase 2 delta-inference, weighting, or sup-t issues above, so none of them are mitigated by tracking.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new R fixture generator already captures multi-horizon SE/CI and placebo outputs, but the parity suite only asserts positive-horizon point estimates, and the new unit tests only pin overall_att == delta plus placebo key existence. That gap is why the missing default delta inference path and the sup-t calibration bug can pass CI. Concrete fix: extend TestDCDHDynRParityMultiHorizon to assert per-horizon SE/CI and placebo outputs, and add a unit test that fit(..., L_max>1, n_bootstrap=0) returns finite overall_se on a nondegenerate panel. Location: benchmarks/R/generate_dcdh_dynr_test_values.R#L295, tests/test_chaisemartin_dhaultfoeuille_parity.py#L239, tests/test_chaisemartin_dhaultfoeuille.py#L1911, tests/test_chaisemartin_dhaultfoeuille.py#L1937.
Severity: P3. Impact: the public docs are still internally inconsistent about shipped Phase 2 behavior: README still says Phase 2 multi-horizon output and placebo bootstrap “will add”, and the registry checklist still says L_max is behind a NotImplementedError gate. Concrete fix: reconcile README and REGISTRY with the actual Phase 2 surface and the documented placebo limitation. Location: README.md#L1160, README.md#L1255, docs/methodology/REGISTRY.md#L610, docs/methodology/REGISTRY.md#L542.

Path to Approval

Compute analytical delta inference on the default n_bootstrap=0 path from multi_horizon_se, and add a regression test that overall_se is finite for a nondegenerate L_max > 1 fit.
Decide the Phase 2 weighting contract and implement it consistently: either switch to the paper’s N_{g,t} weighting throughout the DID_l / placebo / normalized / delta / IF pipeline, or add an explicit REGISTRY Note/Deviation label that equal-cell weighting is intentionally carried forward to all Phase 2 estimands and add an unequal-cell regression test.
Rework sup-t calibration to use one shared multiplier draw across all horizons in each bootstrap replicate, then add a regression test that the dCDH event-study bootstrap uses the same shared-weight pattern as staggered_bootstrap.py.
Extend parity and unit coverage to assert horizon-level SE/CI, placebo outputs, and the overall_* contract for L_max > 1.

…l Note - Compute delta-method SE regardless of bootstrap (was gated on bootstrap_results != None, leaving analytical path with NaN) - Generate one shared bootstrap weight matrix for all horizons so sup-t bands are a valid joint multiplier-bootstrap band - Add REGISTRY Note for Phase 2 equal-cell weighting deviation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T17:52:43Z

/ai-review

github-actions · 2026-04-12T18:05:24Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 41bc1e8c3ca5a03b8bc1965782a782d1d17b72bf

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review status: the earlier P1s on equal-cell weighting, default analytical delta inference, and sup-t shared bootstrap draws look resolved or explicitly documented in the current diff.
P1 [Newly identified]: the new L_max path does not preserve the library’s documented terminal-missingness contract. A switcher/placebo can remain in N_l / N_pl_l after all observed baseline-matched controls disappear at that horizon, and the code zero-retains the group instead of excluding it.
That denominator issue affects DID_l, DID^{pl}_l, and downstream DID^n_l, delta, and per-horizon SEs because they reuse the same Phase 2 eligibility/count logic.
P2: the new R fixtures already contain horizon-level SE/CI and placebo outputs, but the parity suite only asserts point estimates and there is no Phase 2 ragged-panel regression test.
P3: public docs/docstrings still mix shipped Phase 2 behavior with stale “Phase 2 will add” / placeholder wording and outdated top-level bootstrap-inference semantics once overall_att becomes delta.
Static review only: I could not run the Python suite in this read-only sandbox with missing runtime deps.

Methodology

Affected methods: multi-horizon DID_l, dynamic placebos DID^{pl}_l, normalized DID^n_l, and cost-benefit delta. These are all part of the dynamic companion paper’s event-study/cost-benefit construction. (nber.org)

Severity: P1 [Newly identified]. Impact: the Phase 2 helper stack treats switch-timing eligibility as sufficient even on ragged panels. _compute_group_switch_metadata() sets T_g from switch times only, _compute_multi_horizon_dids() / _compute_multi_horizon_placebos() then mark a group eligible from that bound, and only later discover that the observed control pool is empty. At that point the code sets DID_{g,l} / DID^{pl}_{g,l} to 0.0 but still retains the group in N_l / N_pl_l. On supported terminal-missingness panels, that pulls DID_l and DID^{pl}_l toward zero, and the same bad denominator then flows into DID^n_l, delta, and the per-horizon SE path. The paper’s group-level DID and placebo constructions rely on not-yet-switched controls at F_g+ℓ, and the placebo denominator is explicitly defined over groups for which both the effect and placebo can be computed; zero-retaining missing-data failures is not a documented deviation in the registry. (nber.org) Concrete fix: make Phase 2 eligibility control-pool-aware under terminal missingness. Either compute T_g from actually observed baseline-matched controls, or keep T_g as a switch-time upper bound but exclude (g,l) from N_l / N_pl_l whenever the observed control pool is empty because of missingness. Reserve zero-retention for true A11/treatment-support failures only, and mirror the same rule in _compute_per_group_if_multi_horizon(). Location: diff_diff/chaisemartin_dhaultfoeuille.py:L2022-L2118, diff_diff/chaisemartin_dhaultfoeuille.py:L2121-L2265, diff_diff/chaisemartin_dhaultfoeuille.py:L2280-L2375, diff_diff/chaisemartin_dhaultfoeuille.py:L2378-L2495, diff_diff/chaisemartin_dhaultfoeuille.py:L2501-L2682, docs/methodology/REGISTRY.md:L477-L478, docs/methodology/REGISTRY.md:L604-L604, README.md:L1259-L1259.

Code Quality

No additional findings beyond the methodology issue above.

Performance

No findings.

Maintainability

No additional findings.

Tech Debt

No additional findings. The active P1 above is not covered by TODO.md:L51-L77, so it remains unmitigated under the stated review rules.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new benchmark generator now captures per-horizon SE/CI and placebo outputs, but the new parity suite only asserts point estimates, and the new unit tests do not cover Phase 2 terminal missingness. That leaves the newly identified denominator bug unpinned and would also miss regressions in SE/CI/placebo propagation. Concrete fix: extend the multi-horizon parity helper to assert overall_se, overall_ci_lo, overall_ci_hi, and placebo entries from the golden file, and add a dedicated L_max >= 2 ragged-panel test with a terminally missing control. Location: benchmarks/R/generate_dcdh_dynr_test_values.R:L294-L389, tests/test_chaisemartin_dhaultfoeuille_parity.py:L221-L284, tests/test_chaisemartin_dhaultfoeuille.py:L647-L689, tests/test_chaisemartin_dhaultfoeuille.py:L1781-L2125.
Severity: P3. Impact: several docs/docstrings still describe Phase 2 as future work or describe top-level bootstrap inference in Phase-1-only terms, even though L_max is now shipped and overall_att becomes delta when L_max > 1. That is user-facing confusion, not a blocker. Concrete fix: reconcile README, REGISTRY, and results docstrings with the shipped Phase 2 surface and the current delta inference contract. Location: README.md:L1160-L1160, README.md:L1255-L1255, docs/methodology/REGISTRY.md:L590-L612, diff_diff/chaisemartin_dhaultfoeuille_results.py:L150-L177, diff_diff/chaisemartin_dhaultfoeuille_results.py:L303-L313.

Path to Approval

Fix Phase 2 horizon eligibility so missingness-driven empty control pools are excluded from N_l / N_pl_l instead of zero-retained, and apply the same rule in _compute_per_group_if_multi_horizon().
Add a regression test for L_max >= 2 on a terminally missing panel that asserts the affected switcher/placebo is excluded rather than zero-retained, and extend parity coverage to horizon-level SE/CI/placebo outputs.

…taining When a switcher's control pool is empty at a given horizon (e.g., due to terminal missingness), exclude the group from N_l / N_pl_l rather than zero-retaining it. Zero-retention biases DID_l toward zero on ragged panels. Reserve zero-contribution for the IF path only (where it naturally has no effect). Same fix applied to _compute_multi_horizon_placebos. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T18:20:16Z

/ai-review

github-actions · 2026-04-12T18:27:15Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a7f32990c6b2b9f8d5e3e1e630e0e08866775323

Overall Assessment

⚠️ Needs changes

Executive Summary

The earlier ragged-panel / empty-control-pool finding is only partially fixed: DID_l now excludes those (g,l) pairs, but the new IF/SE/bootstrap path still treats them as zero contributions, so inference remains inconsistent on terminal-missing panels.
The Phase 2 surface now explicitly allows event_study_effects[1] to differ from Phase 1 DID_M when L_max >= 2, while the registry and README still state DID_M = DID_1.
When L_max > 1 and n_bootstrap > 0, top-level overall_* inference for the new default overall estimand delta is recomputed with safe_inference(), even though the registry/results docs and summary still promise percentile-bootstrap top-level inference.
The new R fixtures already contain SE/CI/placebo outputs, but the Phase 2 parity tests still pin only point estimates.
Static review only; I did not execute the suite in this read-only sandbox.

Methodology

Affected methods: multi-horizon DID_l, the per-group DID_{g,l} IF/SE path, DID_1, and top-level delta inference.

Severity: P1. Impact: _compute_multi_horizon_dids() now excludes switchers whose observed baseline-matched control pool is empty, but _compute_per_group_if_multi_horizon() still leaves those same cases in the IF as zero contributions, and both the analytical SE path and mh_boot_inputs recenter over that larger zero-retained sample while dividing by the reduced N_l. On terminal-missing panels, the point estimate and SE/CI/sup-t/delta SE are therefore computed from different eligibility rules. Concrete fix: make the IF/bootstrap path consume the finalized Phase 2 eligible_mask (or recompute the same empty-control exclusion there), and add a regression with a terminally missing control that asserts both DID_l and its inference exclude the same switcher. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L2242, diff_diff/chaisemartin_dhaultfoeuille.py:L2374, diff_diff/chaisemartin_dhaultfoeuille.py:L1096, diff_diff/chaisemartin_dhaultfoeuille.py:L1358.
Severity: P1. Impact: the Phase 2 API intentionally switches event_study_effects[1] to a different computation path whenever L_max >= 2, and the new test suite explicitly accepts that it “may differ” from Phase 1 DID_M. That conflicts with the documented/source-material identity DID_M = DID_1 and makes the l=1 estimate depend on whether the user asks for additional horizons. Concrete fix: either make event_study_effects[1] reuse the Phase 1 DID_M estimand when L_max is set, or add an explicit REGISTRY.md Note/Deviation and distinct user-facing labeling if the project intends Phase 2 DID_1 to be a different estimand. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L1439, tests/test_chaisemartin_dhaultfoeuille.py:L1839, docs/methodology/REGISTRY.md:L466, README.md:L1160.
Severity: P1. Impact: after bootstrap runs, Phase 2 overwrites the top-level overall estimand with delta and recomputes overall_p_value / overall_conf_int via safe_inference(delta, delta_se). The registry, results docstring, and summary still tell users that top-level overall_* fields are percentile-bootstrap when n_bootstrap > 0, so the new default overall estimand silently breaks the documented inference contract. Concrete fix: use the shared horizon bootstrap draws to form delta_b = Σ_l w_l DID_{l,b} and populate top-level overall_* from _compute_effect_bootstrap_stats; if the project intentionally wants normal-theory delta inference, document that exception everywhere and stop printing the percentile-bootstrap note for delta. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L1496, docs/methodology/REGISTRY.md:L542, docs/methodology/REGISTRY.md:L590, diff_diff/chaisemartin_dhaultfoeuille_results.py:L150, diff_diff/chaisemartin_dhaultfoeuille_results.py:L543.

Code Quality

No additional findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No additional findings.

Tech Debt

No separate findings. I did not find matching TODO.md entries that would mitigate the active P1 issues above.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new golden fixtures already include horizon-level SE/CI/placebo outputs, but the Phase 2 parity suite still asserts only point estimates. That leaves the new inference surfaces unpinned and would not catch the delta bootstrap-contract regression above. Concrete fix: extend _check_multi_horizon() to assert overall_se, CI bounds, n_switchers, and placebo entries from the golden file, and add a dedicated ragged-panel L_max >= 2 regression for empty observed control pools. Location: tests/test_chaisemartin_dhaultfoeuille_parity.py:L239, benchmarks/R/generate_dcdh_dynr_test_values.R:L294.
Severity: P3. Impact: some user-facing docs still describe shipped Phase 2 behavior as future work and still say placebo bootstrap “will add” in Phase 2, which no longer matches the code/registry. Concrete fix: reconcile the README and dCDH results docstrings with the actual L_max/delta/dynamic-placebo surface. Location: README.md:L1160, README.md:L1255.

Path to Approval

Make the Phase 2 IF/bootstrap path use the same post-exclusion horizon eligibility that DID_l uses, then add a terminal-missingness regression that checks both the estimate and its inference.
Resolve the DID_M versus DID_1 contract: either keep l=1 invariant to L_max or explicitly document and relabel the alternate Phase 2 l=1 estimand in REGISTRY.md and user-facing docs.
Generate a bootstrap distribution for delta from the shared horizon draws and populate top-level overall_* from it; otherwise, document the analytical fallback everywhere and stop advertising percentile-bootstrap top-level inference for L_max > 1.

…delta contracts - IF/SE/bootstrap paths now use the same combined eligibility mask as _compute_multi_horizon_dids (singleton-baseline + empty-control-pool exclusions), so point estimate and inference agree on terminal-missing panels - Add REGISTRY Note documenting that event_study_effects[1] uses per-group DID_{g,1} (cohort-based controls) when L_max >= 2, which may differ from Phase 1 DID_M (period-based controls) on mixed-direction panels - Add REGISTRY Note documenting that delta SE uses delta-method (normal-theory) even when bootstrap is enabled, as an intentional exception to the bootstrap-inference-surface contract Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T18:34:10Z

/ai-review

github-actions · 2026-04-12T18:49:32Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4d71c59d7320d3a21343cb72aa6455a0fbe1fa98

Overall Assessment

⛔ Blocker

Executive Summary

Affected methods: multi-horizon DID_l, dynamic placebos DID^{pl}_l, normalized DID^n_l, cost-benefit delta, and the new multiplier-bootstrap / sup-t inference surface.
The prior DID_1 vs DID_M identity issue and the top-level delta inference-contract issue are now explicitly documented in docs/methodology/REGISTRY.md, so they are no longer blocking defects under the registry rules.
The new Phase 2 variance path still drops control-role groups out of the horizon IF before recentering and bootstrapping, so the new SEs/CIs/p-values/sup-t bands are not based on the full group-level IF.
The new delta inference path silently skips weighted horizons with NaN SE instead of propagating NaN to all top-level inference fields, which can yield misleading finite p-values/CIs.
The new golden fixtures include horizon SE/CI/placebo outputs, but the parity suite still only pins point estimates, so it would not catch the new inference bugs.
Static review only; I did not execute the suite here because the sandbox Python environment is missing numpy.

Methodology

Severity: P0. Impact: overall_att becomes the cost-benefit delta, but its SE path silently drops any positively weighted horizon whose SE is non-finite (if np.isfinite(se_l): var_delta += ...). That means overall_p_value / overall_conf_int can remain finite even when part of the aggregate has undefined inference, violating the project’s NaN-consistent inference contract for statistical output. Concrete fix: before computing delta_se, require every horizon with w_l > 0 to have finite SE; otherwise set overall_se, overall_t_stat, overall_p_value, and overall_conf_int to NaN. If you want non-NaN inference here, bootstrap delta directly from aligned horizon draws instead of dropping horizons term-by-term. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L1510-L1539, docs/methodology/REGISTRY.md:L576-L580, diff_diff/utils.py:L152-L186.
Severity: P1. Impact: _compute_per_group_if_multi_horizon() explicitly builds a full per-group IF with both switcher and control-role contributions, but the analytical SE path and the bootstrap path immediately intersect that IF with did_eligible, i.e. the switcher-only mask. That drops never-switchers and later-switching controls even when they carry non-zero IF mass, so the new analytical SEs, bootstrap SEs/CIs/p-values, sup-t bands, normalized-effect SEs, and delta SE are all based on an incomplete IF and will be understated. This is not aligned with the registry’s group-level IF contract, which says stable controls can have non-zero U^G_g. Concrete fix: keep the Phase 2 U_l vectors at full variance-eligible group length (apply only the singleton-baseline variance filter), recenter on that full group set, and feed those full-length vectors into the shared-weight bootstrap so group identities stay aligned across horizons. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L1096-L1105, diff_diff/chaisemartin_dhaultfoeuille.py:L1364-L1393, diff_diff/chaisemartin_dhaultfoeuille.py:L2300-L2398, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L254-L309, docs/methodology/REGISTRY.md:L564-L570, docs/methodology/REGISTRY.md:L590-L592.
Severity: P3. Impact: The prior re-review blockers around event_study_effects[1] versus Phase 1 DID_M, and around normal-theory delta inference under n_bootstrap > 0, are now explicitly documented in the registry with Note labels, so they no longer count as methodology defects under the stated review policy. Concrete fix: None required for approval; optional user-doc sync remains below. Location: docs/methodology/REGISTRY.md:L536-L546.

Code Quality

No additional findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No additional findings.

Tech Debt

Severity: P3. Impact: TODO.md does not currently track the new Phase 2 parity/docs follow-up work, so if that cleanup is deferred it is not mitigated by the project’s deferred-work policy. The active P0/P1 inference bugs are not TODO-mitigable in any case. Concrete fix: if you defer the P2 test/doc cleanup, add an entry under TODO.md → “Tech Debt from Code Reviews.” Location: TODO.md:L51-L77.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new golden-value generator now records per-horizon SE, CI, placebo, and switcher-count outputs, but the multi-horizon parity suite still asserts only point estimates, and the sup-t tests only check that simultaneous bands are wider than pointwise CIs on a synthetic panel. That would not catch either inference bug above. Concrete fix: extend _check_multi_horizon() to assert horizon overall_se, CI bounds, n_switchers, and placebo outputs from the golden file, and add one regression where a control-only group has non-zero U_l outside the switcher set plus one regression where a positive-weight horizon has NaN SE and delta inference must be all-NaN. Location: benchmarks/R/generate_dcdh_dynr_test_values.R:L294-L325, tests/test_chaisemartin_dhaultfoeuille_parity.py:L239-L258, tests/test_chaisemartin_dhaultfoeuille.py:L2069-L2101.
Severity: P3. Impact: several user-facing docs still describe Phase 2 behavior as future work or say top-level overall_* inference is always percentile-bootstrap when n_bootstrap > 0, which is no longer true once overall_att is repurposed to delta; the placebo notes also still say Phase 2 will add bootstrap placebo inference even though the registry now documents that as deferred follow-up. Concrete fix: align README / API docs / results docstring / summary note with the registry’s Phase 2 notes. Location: README.md:L1160-L1160, README.md:L1208-L1255, diff_diff/chaisemartin_dhaultfoeuille_results.py:L150-L166, diff_diff/chaisemartin_dhaultfoeuille_results.py:L543-L548, docs/api/chaisemartin_dhaultfoeuille.rst:L199-L214, docs/choosing_estimator.rst:L289-L295, docs/llms-full.txt:L321-L323.

Path to Approval

Rebuild the Phase 2 analytical SE and bootstrap inputs on the full group-level horizon IF, not the switcher-only did_eligible subset. A minimal regression should assert that a control-only group can carry non-zero U_l and changes the horizon SE.
Make delta inference NaN-consistent: if any positively weighted horizon has non-finite SE, set all top-level delta inference fields to NaN, or bootstrap delta directly from aligned horizon draws. Add a regression with one weighted horizon degenerate and assert overall_se, overall_t_stat, overall_p_value, and overall_conf_int all become NaN.
Expand the multi-horizon parity coverage to pin SE/CI/placebo/switcher-count outputs from the existing golden fixtures so the inference surface is actually protected in CI.

- Delta SE now requires ALL positively-weighted horizons to have finite SE; if any has NaN, overall_se/p/CI are all NaN (NaN-consistent inference contract) - IF/SE/bootstrap paths use full variance-eligible group set (singleton- baseline filter only), not the switcher-only did_eligible mask. Never- switchers and later-switching controls with non-zero IF mass from their control roles are now included, matching the Phase 1 IF contract. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T18:51:59Z

/ai-review

github-actions · 2026-04-12T19:00:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8cac82d46f72e85861d031788783fb2b538ad382

Overall Assessment

✅ Looks good

Executive Summary

Affected methods: multi-horizon DID_l, dynamic placebos DID^{pl}_l, normalized DID^n_l, cost-benefit delta, and the multi-horizon variance/bootstrap surface from the dynamic companion paper. (nber.org)
The prior blocker-level issues appear resolved: the multi-horizon analytical SE/bootstrap now keep the full variance-eligible group-level IF, and delta inference now propagates NaN consistently when a positively weighted horizon lacks a finite SE.
The remaining substantive gap is non-blocking: the new multi-horizon parity fixtures include SE/CI/placebo/switcher-count outputs, but the parity test still asserts only point estimates.
One user-facing warning is misleading: it says affected horizons are zero-retained in N_l / N^{pl}_l, while the helper actually drops those groups from the denominator.
Static diff review only; I did not execute the test suite here.

Methodology

No unmitigated P0/P1 findings. The previously reported IF-scope and delta NaN-consistency blockers appear fixed in diff_diff/chaisemartin_dhaultfoeuille.py:L1096-L1105, diff_diff/chaisemartin_dhaultfoeuille.py:L1364-L1392, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L254-L309, and diff_diff/chaisemartin_dhaultfoeuille.py:L1521-L1548. The remaining deviations I reviewed are explicitly documented in docs/methodology/REGISTRY.md:L520-L546, so they are non-blocking under the stated policy. (nber.org)

Code Quality

Severity: P3
Impact: The new multi-horizon warnings say affected DID_{g,l} / DID^{pl}_l values are zeroed and retained in N_l / N^{pl}_l, but the helpers actually set eligible[g] = False and recompute the denominator. That makes the warning/debug story inconsistent with the estimator behavior on empty-control-pool cases.
Concrete fix: Either update the warning text to say the group is excluded from N_l / N^{pl}_l, or change the helpers to true zero-retention if that is the intended contract. Refs: diff_diff/chaisemartin_dhaultfoeuille.py:L1036-L1044, diff_diff/chaisemartin_dhaultfoeuille.py:L2259-L2268, diff_diff/chaisemartin_dhaultfoeuille.py:L1160-L1166, diff_diff/chaisemartin_dhaultfoeuille.py:L2508-L2518.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3
Impact: The remaining test/doc cleanup below is deferrable, but it is not currently tracked under TODO.md's "Tech Debt from Code Reviews" section.
Concrete fix: If the parity/docs cleanup is deferred, add one entry under TODO.md:L51-L97 referencing this PR and the specific deferred work.

Security

No findings.

Documentation/Tests

Severity: P2
Impact: The new R fixture generator now records per-horizon SEs, CI bounds, placebo outputs, and switcher counts, but the new multi-horizon parity test still asserts only point estimates. A regression in horizon inference, placebo outputs, or n_switchers would bypass the main parity guard.
Concrete fix: Extend _check_multi_horizon() to assert overall_se, overall_ci_lo, overall_ci_hi, n_switchers, and placebo outputs from the existing golden file, and add a dedicated regression where a positively weighted horizon has NaN SE so overall_se, overall_t_stat, overall_p_value, and overall_conf_int must all be NaN. Refs: benchmarks/R/generate_dcdh_dynr_test_values.R:L294-L325, tests/test_chaisemartin_dhaultfoeuille_parity.py:L221-L259.
Severity: P3
Impact: README still mixes shipped-Phase-2 wording with older "Phase 2 will add DID_l" text and still describes overall_att as always DID_M, which conflicts with the new L_max > 1 behavior where overall_att becomes delta.
Concrete fix: Align the README overview/table with the current registry and results surface. Refs: README.md:L1160-L1160, README.md:L1208-L1210, README.md:L1216-L1239, docs/methodology/REGISTRY.md:L520-L546.

- Fix multi-horizon control-availability warnings to say "excluded from N_l" (not "zeroed and retained") - Update README overall_att description to reflect delta when L_max > 1 - Update README placebo Note to reflect current state (not "Phase 2 will add") - Update ROADMAP 2c and 2h status to reflect placebo SE and parity SE/placebo deferrals Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T19:11:46Z

/ai-review

github-actions · 2026-04-12T19:30:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 39f3978296dbe18608febbd28c52a2a4b4c66afa

Overall Assessment

⛔ Blocker

The re-review found one unmitigated P0 in the new Phase 2 dynamic placebo path and one unmitigated P1 in the new Phase 2 top-level delta surface. Static diff review only; I did not execute the test suite in this environment.

Executive Summary

Affected methods reviewed: multi-horizon DID_l, dynamic placebos DID^{pl}_l, normalized DID^n_l, cost-benefit delta, and the Phase 2 inference/result-propagation surface.
P0: the new placebo helper can include terminally missing controls that are not observed at the forward eligibility horizon, which changes control-group composition for DID^{pl}_l with no warning.
P1: when Phase 2 delta is non-estimable, the top-level overall_* fields can silently remain the old Phase 1 DID_M values even though the results surface labels them as delta.
The prior warning-text inconsistency around exclusion from N_l / N^{pl}_l appears fixed.
The prior non-blocking parity gap remains: the new golden fixtures include SE/CI/placebo/n_switchers, but the multi-horizon parity test still checks only point estimates.

Methodology

Affected methods: DID_l, DID^{pl}_l, DID^n_l, and delta.

Severity: P0. Impact: DID^{pl}_l can admit terminally missing controls that are not observed at the forward cutoff period. The registry says terminally missing groups should contribute only from observed periods, and the placebo definition is keyed to the positive-horizon support condition. The positive-horizon DID_l helper enforces outcome-period observation for controls, but the placebo helper only checks ref_idx and backward_idx, not forward_idx; combined with T_g treating first_switch_idx == -1 groups as extending support to panel end, this can silently bias placebo estimates through incorrect control composition. Concrete fix: require placebo controls to satisfy N_mat[ctrl_indices, forward_idx] > 0 as well, and/or recompute T_g from observed forward support rather than raw first_switch_idx; add a regression with terminally missing baseline-matched controls. Refs: docs/methodology/REGISTRY.md:L477, docs/methodology/REGISTRY.md:L534-L546, diff_diff/chaisemartin_dhaultfoeuille.py:L2112-L2131, diff_diff/chaisemartin_dhaultfoeuille.py:L2245-L2255, diff_diff/chaisemartin_dhaultfoeuille.py:L2497-L2505
Severity: P1. Impact: Phase 2 documents that overall_att becomes the cost-benefit delta when L_max > 1, but if _compute_cost_benefit_delta() returns NaN the fit path never overrides the precomputed Phase 1 overall_* values. The result object still labels those fields as delta in __repr__, summary(), and to_dataframe("overall"), so an empty-support Phase 2 run can expose a mislabeled top-level estimand. Concrete fix: when L_max >= 2 and delta is non-finite, set overall_att, overall_se, overall_t_stat, overall_p_value, and overall_conf_int to NaN and warn, instead of silently falling back to DID_M. Refs: docs/methodology/REGISTRY.md:L532-L544, diff_diff/chaisemartin_dhaultfoeuille.py:L1502-L1547, diff_diff/chaisemartin_dhaultfoeuille.py:L2645-L2652, diff_diff/chaisemartin_dhaultfoeuille_results.py:L411-L415, diff_diff/chaisemartin_dhaultfoeuille_results.py:L509-L540, diff_diff/chaisemartin_dhaultfoeuille_results.py:L783-L789
Severity: P3. Impact: Other methodological deviations I checked are documented in the registry and therefore non-blocking under the review policy: event_study_effects[1] vs Phase 1 DID_M, equal-cell weighting vs R, delta-method SE for delta, and NaN inference for dynamic placebos. Concrete fix: none. Refs: docs/methodology/REGISTRY.md:L536-L546

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: the remaining non-blocking parity/docs follow-ups are not tracked under TODO.md's “Tech Debt from Code Reviews” section, so if they are intentionally deferred they have no project-level breadcrumb. Concrete fix: if those items are not fixed in this PR, add a TODO.md entry referencing this PR and the specific deferred work. Refs: TODO.md:L51-L98

Security

No findings.

Documentation/Tests

Severity: P2. Impact: the new R fixture generator now stores per-horizon SEs, CI bounds, placebo outputs, and n_switchers, but _check_multi_horizon() still asserts only point estimates. A regression in Phase 2 inference, placebo outputs, or switcher counts would bypass the main parity guard. Concrete fix: extend _check_multi_horizon() to assert at least per-horizon SE, CI bounds, n_switchers, and placebo point estimates from the existing golden file. Refs: benchmarks/R/generate_dcdh_dynr_test_values.R:L294-L325, tests/test_chaisemartin_dhaultfoeuille_parity.py:L239-L258
Severity: P3. Impact: several changed docs still contain stale or internally inconsistent Phase 2 language. Examples: README still says “Phase 2 will add” multi-horizon output, the registry checklist still says L_max is behind a NotImplementedError gate, and other docs still say placebo bootstrap will arrive “in Phase 2” even though the new registry note says placebo inference remains deferred. Concrete fix: align README/REGISTRY/chooser docs with the shipped Phase 2 contract and the still-deferred placebo inference work. Refs: README.md:L1160, docs/choosing_estimator.rst:L289-L295, docs/methodology/REGISTRY.md:L586-L586, docs/methodology/REGISTRY.md:L614-L617

Path to Approval

In _compute_multi_horizon_placebos(), exclude controls that are not observed at forward_idx, and make sure T_g cannot extend placebo support via terminally missing groups.
Add a regression test where a terminally missing baseline-matched control would currently enter the placebo pool, and assert it is excluded.
When L_max >= 2 and cost_benefit_delta["delta"] is non-finite, set all top-level overall_* fields to NaN with a warning instead of leaving the Phase 1 DID_M values in place.
Add a regression test for a Phase 2 run with non-estimable delta and assert overall_att, overall_se, overall_t_stat, overall_p_value, and overall_conf_int are NaN-consistent.

…lback - Add N_mat[ctrl, forward_idx] > 0 to placebo control mask so terminally missing controls don't leak into DID^{pl}_l - When delta is NaN (non-estimable), set all overall_* to NaN instead of silently falling back to the Phase 1 DID_M values Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-12T19:33:05Z

/ai-review

github-actions · 2026-04-12T19:43:04Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a3b01bc706eaead4d89640cd8b06523341a9929d

Overall Assessment

✅ Looks good

Executive Summary

Re-review focus: both prior blocker paths appear addressed. The placebo helper now requires forward-period observation in the control pool, and non-estimable Phase 2 delta now forces all top-level overall_* outputs to NaN instead of surfacing stale Phase 1 DID_M values (diff_diff/chaisemartin_dhaultfoeuille.py:L2496-L2518, diff_diff/chaisemartin_dhaultfoeuille.py:L1508-L1519). The underlying DID_{g,l} / DID_l, normalized, delta, and placebo definitions are consistent with the cited dynamic-paper sections the registry points to. (nber.org)
I did not find a new unmitigated P0/P1 methodology defect in the changed estimator math or inference surfaces on static review.
Severity: P2. Impact: multi-horizon parity tests still validate only point estimates even though the fixture generator now records per-horizon SEs, CI bounds, placebo outputs, and switcher counts, so inference/placebo regressions can slip through (benchmarks/R/generate_dcdh_dynr_test_values.R:L294-L325, tests/test_chaisemartin_dhaultfoeuille_parity.py:L239-L258).
Severity: P2. Impact: summary() still claims the top-level p-value/CI are bootstrap-percentile whenever bootstrap is enabled, but for Phase 2 delta those fields are normal-theory delta-method outputs per the new registry note (diff_diff/chaisemartin_dhaultfoeuille_results.py:L543-L552, docs/methodology/REGISTRY.md:L544-L546, docs/methodology/REGISTRY.md:L592-L592).
Static diff review only; I did not execute the test suite in this environment.

Methodology

No findings. I cross-checked the new multi-horizon DID_l, dynamic placebo, normalized DID^n_l, and cost-benefit delta surfaces against the methodology registry and the cited dynamic-paper definitions, and I did not find a remaining undocumented paper mismatch in the changed estimator logic under the current binary-treatment contract (diff_diff/chaisemartin_dhaultfoeuille.py:L2147-L2737, docs/methodology/REGISTRY.md:L518-L546). (nber.org)

Code Quality

Severity: P2. Impact: the Phase 2 summary note is now semantically wrong for overall_* when L_max >= 2 and bootstrap is enabled. The code prints “multiplier-bootstrap percentile inference” whenever bootstrap_results exists and overall_se is finite, but the new Phase 2 contract explicitly computes delta inference from safe_inference(delta, delta_se) instead. That can mislead users about what interval/p-value they are reading. Concrete fix: branch the summary note on self.L_max >= 2 and self.cost_benefit_delta is not None, and describe the delta path as delta-method/normal-theory even when per-horizon SEs came from bootstrap (diff_diff/chaisemartin_dhaultfoeuille_results.py:L543-L552, docs/methodology/REGISTRY.md:L544-L546, docs/methodology/REGISTRY.md:L592-L592).

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: the remaining deferred parity/docs follow-ups are still not tracked in TODO.md’s “Tech Debt from Code Reviews” section, so there is no project breadcrumb if they stay deferred. Concrete fix: add a TODO.md entry tying this PR to the still-deferred parity expansion and doc cleanup (TODO.md:L51-L97).

Security

No findings.

Documentation/Tests

Severity: P2. Impact: _check_multi_horizon() still asserts only overall_att, despite the fixture generator now exporting overall_se, CI bounds, placebo estimates, and n_switchers. A regression in Phase 2 inference or placebo handling would not be caught by the main R-parity guard. Concrete fix: extend _check_multi_horizon() to assert at least per-horizon overall_se, CI bounds, n_switchers, and placebo point estimates from the existing fixtures (benchmarks/R/generate_dcdh_dynr_test_values.R:L294-L325, tests/test_chaisemartin_dhaultfoeuille_parity.py:L239-L258).
Severity: P3. Impact: several docs in changed files still describe shipped Phase 2 behavior as future work or still say L_max is gated/reserved, which now conflicts with the implementation and with the new Phase 2 registry notes. Concrete fix: align the stale sections in README.md:L1160-L1160, docs/choosing_estimator.rst:L289-L295, docs/methodology/REGISTRY.md:L586-L586, docs/methodology/REGISTRY.md:L614-L617, and docs/llms-full.txt:L321-L323.

- summary() now describes delta SE as delta-method (normal-theory) when L_max >= 2 with bootstrap, instead of claiming percentile - Update README, choosing_estimator.rst, REGISTRY.md to reflect shipped Phase 2 state (was: "Phase 2 will add") Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber added the ready-for-ci Triggers CI test workflows label Apr 12, 2026

igerber merged commit 6e43050 into main Apr 12, 2026
24 of 26 checks passed

igerber deleted the dcdh-phase2 branch April 12, 2026 20:53

Conversation

igerber commented Apr 12, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

igerber commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant