Skip to content

Add Phase 2 multi-horizon event study for dCDH estimator#294

Merged
igerber merged 11 commits intomainfrom
dcdh-phase2
Apr 12, 2026
Merged

Add Phase 2 multi-horizon event study for dCDH estimator#294
igerber merged 11 commits intomainfrom
dcdh-phase2

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 12, 2026

Summary

  • Add multi-horizon event study DID_l for l = 1..L_max to ChaisemartinDHaultfoeuille via the L_max parameter (ROADMAP items 2a-2h)
  • Per-group DID_{g,l} building block (Eq 3 of dynamic companion paper, NBER WP 29873)
  • Per-horizon cohort-recentered analytical SE (Web Appendix Section 3.7.3)
  • Dynamic placebos DID^{pl}_l with dual eligibility condition (Section 1.1)
  • Normalized estimator DID^n_l = DID_l / delta^D_l (Section 3.2)
  • Cost-benefit aggregate delta (Section 3.3, Lemma 4) - becomes overall_att when L_max > 1
  • Sup-t simultaneous confidence bands via multiplier bootstrap
  • plot_event_study() integration with <50% switcher warning
  • to_dataframe("event_study") and to_dataframe("normalized") output levels
  • Per-horizon bootstrap with SE/CI/p-value propagation to event_study_effects
  • R DIDmultiplegtDYN parity tests at multiple horizons (4 new scenarios)
  • L_max=None (default) preserves exact Phase 1 behavior (142 tests, 0 failures)

Methodology references (required if estimator / math changes)

  • Method name(s): ChaisemartinDHaultfoeuille (DCDH) - multi-horizon DID_l, normalized DID^n_l, cost-benefit delta, dynamic placebos DID^{pl}_l
  • Paper / source link(s): de Chaisemartin, C. & D'Haultfoeuille, X. (2022, revised 2024). Difference-in-Differences Estimators of Intertemporal Treatment Effects. NBER Working Paper 29873. Equations 3, 5, 15; Lemma 4; Web Appendix Section 1.1, Section 3.7.3.
  • Any intentional deviations from the source (and why): Cell-count weighting (Phase 1 deviation, carried forward); period-based vs cohort-based stable controls at l=1 (Phase 1 deviation, carried forward); per-group DID_{g,l} path at l>=2 uses the paper's cohort-based controls naturally

Validation

  • Tests added/updated: tests/test_chaisemartin_dhaultfoeuille.py (+18 Phase 2 tests), tests/test_chaisemartin_dhaultfoeuille_parity.py (+4 multi-horizon R parity tests)
  • R parity: 9 scenarios (5 Phase 1 + 4 Phase 2), all passing against R DIDmultiplegtDYN v2.3.3
  • Golden values regenerated: benchmarks/data/dcdh_dynr_golden_values.json

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…uille

Implements ROADMAP items 2a-2h: multi-horizon DID_l via per-group
DID_{g,l} building block (Eq 3 of dynamic paper), per-horizon
analytical SE, dynamic placebos DID^{pl}_l, normalized DID^n_l,
cost-benefit aggregate delta, sup-t simultaneous confidence bands,
plot_event_study() integration, and R DIDmultiplegtDYN parity tests
at multiple horizons. L_max parameter controls multi-horizon mode;
L_max=None preserves exact Phase 1 behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

⛔ Blocker

Executive Summary

  • Affected methodology: ChaisemartinDHaultfoeuille Phase 2 multi-horizon DID_l, dynamic placebos DID^{pl}_l, normalized effects DID^n_l, cost-benefit aggregate delta, and multiplier-bootstrap sup-t bands.
  • The new delta implementation does not match the cumulative-dose formula documented in the Methodology Registry, so overall_att becomes the wrong estimand when L_max > 1.
  • The new dynamic placebo helper reverses the pre-trend difference relative to the existing placebo convention, which likely negates DID^{pl}_l.
  • Sup-t bands are calibrated on horizons 2..L_max but then attached to l=1 as well, so the reported “simultaneous” band is miscalibrated.
  • The Phase 2 public surface is internally inconsistent at l=1: event_study_effects[1], normalized_effects[1], and top-level overall_att/labels can refer to different estimands, and the tests currently do not pin the new methodology-sensitive outputs tightly enough to catch that.

Methodology

Code Quality

Performance

No findings.

Maintainability

No additional findings beyond the public-surface contract issue above.

Tech Debt

No relevant TODO.md entry mitigates the P0/P1 issues above, so they remain unmitigated. Location: TODO.md.

Security

No findings.

Documentation/Tests

Path to Approval

  1. Reimplement delta weights from cumulative delta^D_{g,l} rather than one-period dose gaps, and add a binary-treatment regression test with expected l * N_l weighting.
  2. Flip the dynamic placebo difference to reference-minus-preperiod, then extend the R fixture assertions to cover placebo outputs.
  3. Recompute the sup-t critical value over the exact set of horizons that receive simultaneous bands, or stop attaching a band to l=1.
  4. Make the L_max>=2 public contract internally consistent: event_study_effects[1], normalized_effects[1], bootstrap inference, and user-facing labels must all refer to the same estimand with matching SEs.
  5. Add explicit A11/control-availability warnings for empty multi-horizon control pools, especially under terminal missingness.

…onsistency

- Fix cost-benefit delta to use cumulative dose (sum_{k=0}^{l-1} |D_{g,F_g+k} - D_{g,1}|)
  instead of one-period dose; binary weights now proportional to l * N_l
- Flip dynamic placebo sign to ref-minus-preperiod (Y_{ref} - Y_{backward}),
  matching the Phase 1 convention
- Include l=1 in sup-t bootstrap calibration so bands are truly simultaneous
  over all horizons 1..L_max
- Use per-group DID_{g,l} path for event_study_effects[1] when L_max >= 2,
  making all horizons use a consistent estimand
- Label overall_att as "delta" in summary/to_dataframe when L_max > 1
- Add A11 control-availability warnings for multi-horizon empty control pools

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 12, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7ed7fde5784ba56c3fa096eb42db8a0106229cf6


Overall Assessment

⛔ Blocker

Executive Summary

  • Re-review status: the earlier sup-t calibration blocker and the main Phase 2 l=1 estimand-propagation issue appear resolved in the changed code.
  • P0 remains: dynamic placebos are still sign-reversed relative to the cited Phase 2 placebo definition, so the new negative-horizon outputs report the wrong direction. citeturn0search23
  • P0 remains: cost-benefit delta still uses the wrong horizon weights, and overall_att is overwritten with that aggregate when L_max > 1. citeturn0search26
  • P1 remains: multi-horizon placebos still silently zero-retain empty control pools without the warning contract used elsewhere in dCDH.
  • The remaining non-blocking issue is public-surface labeling: some result helpers still call the top-level estimand DID_M even when Phase 2 repurposes it to hold delta.
  • Static review only: I could not run the Python test suite in this sandbox because the environment is missing numpy.

Methodology

  • Severity: P0. Impact: _compute_multi_horizon_placebos() still computes the placebo change as Y_ref - Y_backward for both switchers and controls, but the dynamic companion paper defines DID^{pl}_{g,l} using the backward-minus-reference change. That silently flips the sign of every reported DID^{pl}_l, and the registry does not document this as a deviation. Concrete fix: reverse both subtraction orders in _compute_multi_horizon_placebos() and extend parity coverage to assert the R Placebos fixture, not just Effects. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L2430-L2450, docs/methodology/REGISTRY.md:L518-L538. citeturn0search23

  • Severity: P0. Impact: _compute_cost_benefit_delta() still weights each horizon by cumulative dose up to l, not by Lemma 4’s horizon weight. Under a binary joiners-only design, Lemma 4 implies weights proportional to N_l; the current implementation makes them proportional to l * N_l, so longer horizons are overweighted. Because fit() overwrites overall_att with this delta when L_max > 1, this is a silent main-result error, not just a side-metric issue. The implementation also contradicts its own docstring. Concrete fix: compute w_l from Lemma 4’s per-horizon increment formula and add a binary-treatment regression that distinguishes N_l from l * N_l. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L2526-L2652, diff_diff/chaisemartin_dhaultfoeuille.py:L1483-L1500. citeturn0search26

  • Severity: P1. Impact: the new multi-horizon DID path now surfaces control-availability warnings, but the new placebo path still does not. _compute_multi_horizon_placebos() zeroes pl_g_l[g] when ctrl_pool.size == 0 and retains that group in N_pl_l, yet fit() never emits the consolidated warning contract used by _compute_placebo() and documented in the registry’s “no silent failures” / placebo-warning notes. On ragged or terminally missing panels, that can silently bias DID^{pl}_l toward zero. Concrete fix: accumulate placebo control-availability warnings inside _compute_multi_horizon_placebos() and surface them from fit() the same way Phase 1 placebo warnings and the new multi-horizon DID warnings are handled. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L2434-L2446, diff_diff/chaisemartin_dhaultfoeuille.py:L1140-L1152, diff_diff/chaisemartin_dhaultfoeuille.py:L1885-L1983, docs/methodology/REGISTRY.md:L586-L618.

Re-review note: the prior sup-t calibration finding appears fixed. The bootstrap now calibrates over horizons 1..L_max and applies bands to that same event-study set. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L1321-L1372, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L254-L301, diff_diff/chaisemartin_dhaultfoeuille.py:L1471-L1481.

Code Quality

Performance

No findings.

Maintainability

No additional findings beyond the result-surface contract issue above.

Tech Debt

No mitigating entry in TODO.md:L51-L97 covers the P0/P1 issues above, so they remain unmitigated under the project’s deferral rules.

Security

No findings.

Documentation/Tests

Path to Approval

  1. Reverse the Phase 2 placebo differences in _compute_multi_horizon_placebos() and add parity assertions over Placebos from benchmarks/data/dcdh_dynr_golden_values.json.
  2. Reimplement _compute_cost_benefit_delta() using Lemma 4’s per-horizon weights and add a regression that distinguishes N_l from l * N_l under binary treatment.
  3. Add consolidated warning plumbing for zero-retained multi-horizon placebo control-pool failures, mirroring Phase 1 placebo warnings and the new multi-horizon DID warning path.

…s, labels

- Flip placebo to paper convention (backward - ref) for R parity
- Revert cost-benefit delta to per-period dose (Lemma 4: w_l proportional
  to N_l for binary, not l * N_l)
- Add placebo control-availability warnings mirroring DID path
- Update __repr__ and joiners_leavers labels for delta when L_max > 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 12, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 32f7a5341941d886c1ee77b1363c86510224e0bd


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review status: the prior placebo-sign, Lemma 4 weighting, and multi-horizon placebo-warning findings look addressed in the updated helper code and warning plumbing.
  • P1: when L_max >= 2, the PR repurposes overall_att to the cost-benefit aggregate delta but clears the entire overall_* inference surface to NaN, so the new default top-level estimand ships without SE/CI/p-value.
  • P1: dynamic placebo horizons still ship as point estimates only; negative-horizon inference is hard-coded to NaN, and the new placebo-bootstrap plumbing is never connected.
  • The new R fixtures already capture placebo SE/CI, but the parity suite only asserts positive-horizon point estimates, so the inference gaps are not currently pinned.
  • Static review only: I could not run the Python tests in this sandbox because numpy is unavailable.

Methodology

Code Quality

  • Severity: P2. Impact: summary() still prints the generic “p-value and CI are multiplier-bootstrap percentile inference” note whenever bootstrap_results exists, even in Phase 2 where the displayed overall row is delta with all inference fields set to NaN. That makes the human-readable output claim bootstrap inference for an estimand that currently has none. Concrete fix: gate the note on non-NaN overall_* inference or split the note so it only describes the event-study horizons that actually received bootstrap propagation. Location: diff_diff/chaisemartin_dhaultfoeuille_results.py:L543.

Performance

  • No findings.

Maintainability

  • No additional findings beyond the inference-surface contract issues above.

Tech Debt

  • No mitigating entry under TODO.md:L51 covers the new Phase 2 delta/placebo inference gaps, so they remain unmitigated for this review.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Implement and propagate inference for the Phase 2 cost-benefit delta surfaced in overall_att, or stop overloading overall_* until that inference exists.
  2. Implement dynamic-placebo inference end-to-end and populate placebo_event_study from it, or explicitly defer/remove the public negative-horizon inference surface and document that deferral in REGISTRY.md.
  3. Extend the R parity and unit tests to assert placebo effect/SE/CI and the L_max > 1 overall_* inference contract.

…mary gate

- Compute cost-benefit delta SE from per-horizon SEs via delta method:
  SE(delta) = sqrt(sum w_l^2 * SE(DID_l)^2), giving overall_att
  non-NaN inference when L_max > 1
- Document placebo SE NaN as intentional Phase 2 deferral in REGISTRY.md
  (placebo IF computation deferred; point estimates meaningful for
  visual pre-trends; bootstrap plumbing exists but not wired)
- Gate summary() bootstrap note on non-NaN overall inference
- Remove unused import

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 12, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 00efe07e9bd145776ebaa1d965d9fa6d05a9445c


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review status: the earlier placebo-sign, Lemma 4 weighting-direction, and far-horizon warning issues appear addressed in the current diff.
  • P1: overall_att now becomes Phase 2 delta, but the new SE/CI/p-value path only runs when bootstrap results exist, so the default analytical path still leaves the new top-level estimand with NaN inference.
  • P1: the new Phase 2 DID_l / DID^{pl}_l / DID^n_l / delta code averages controls and switchers equally instead of using the paper’s outcome-period N_{g,t} weights, so unequal cell sizes silently change the estimand with no explicit Phase 2 Note/Deviation label. citeturn8view0
  • P1: the sup-t bands are built from independently bootstrapped horizons rather than one shared bootstrap replicate across the full event-study vector, so the simultaneous bands are not a valid joint multiplier-bootstrap band.
  • P3: the dynamic-placebo inference gap is now explicitly documented in REGISTRY.md, so under this review rubric it is no longer a blocker.
  • Static review only: I did not run the Python test suite in this read-only sandbox.

Methodology

Affected methods: ChaisemartinDHaultfoeuille Phase 2 DID_l, DID^{pl}_l, DID^n_l, delta, and sup-t bands.

  • Severity: P1. Impact: fit(..., L_max >= 2) overwrites overall_att with delta, but the new delta-inference block is guarded by if bootstrap_results is not None, so the default analytical path (n_bootstrap=0) still returns NaN for overall_se, overall_t_stat, overall_p_value, and overall_conf_int. That leaves the PR’s new default top-level estimand without inference unless bootstrap is turned on, even though the registry now says delta SE should be derived from per-horizon SEs. Concrete fix: compute delta_se from multi_horizon_se whenever cost_benefit_result is present, then optionally replace horizon SEs with bootstrap SEs when bootstrap is enabled. Location: diff_diff/chaisemartin_dhaultfoeuille.py#L1502, docs/methodology/REGISTRY.md#L540.

  • Severity: P1. Impact: Equation 3/5 in the dynamic paper weights both the control mean and the horizon aggregate by outcome-period N_{g,t} mass, but the new code uses an unweighted control .mean(), unweighted switcher counts via eligible.sum(), and carries the same equal-group logic into the Phase 2 IF/placebo/normalized/delta paths. On microdata or any panel with uneven (group, time) cell sizes, the shipped Phase 2 outputs estimate a different estimand than the cited method, and there is no explicit Phase 2 Note/Deviation label carrying forward the Phase 1 equal-cell contract. Concrete fix: either implement N_mat[:, out_idx] weights end-to-end in _compute_multi_horizon_dids(), _compute_per_group_if_multi_horizon(), _compute_multi_horizon_placebos(), _compute_normalized_effects(), and _compute_cost_benefit_delta(), or add an explicit REGISTRY Note/Deviation label that equal-cell weighting is intentional for all Phase 2 estimands and pin it with an unequal-cell regression test. Location: diff_diff/chaisemartin_dhaultfoeuille.py#L2221, diff_diff/chaisemartin_dhaultfoeuille.py#L2331, diff_diff/chaisemartin_dhaultfoeuille.py#L2426, docs/methodology/REGISTRY.md#L493, docs/methodology/REGISTRY.md#L518. citeturn8view0

  • Severity: P1. Impact: the new sup-t band code bootstraps each horizon independently by calling _bootstrap_one_target() in a loop, and _bootstrap_one_target() generates a fresh weight matrix on every call. That destroys replicate alignment across horizons, so sup_t_dist[b] is built from unrelated draws rather than one joint bootstrap replicate of the event-study vector. The resulting cband_conf_int is therefore not a valid multiplier-bootstrap simultaneous band. Concrete fix: generate one shared (n_bootstrap, n_groups) multiplier matrix for the full event-study IF matrix and compute all horizon perturbations from that same draw, mirroring the existing shared-weight pattern already used in staggered_bootstrap.py. Location: diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py#L255, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py#L369, diff_diff/staggered_bootstrap.py#L350, diff_diff/staggered_bootstrap.py#L498.

  • Severity: P3. Impact: re-review note only: the dynamic-placebo inference gap is now explicitly documented as a Phase 2 limitation, so I am not counting it as a blocker under the stated review rules. Concrete fix: none required for approval in this PR unless formal placebo inference is in scope. Location: docs/methodology/REGISTRY.md#L542, diff_diff/chaisemartin_dhaultfoeuille.py#L1537.

Code Quality

  • No additional findings beyond the methodology-contract issues above.

Performance

  • No findings.

Maintainability

  • No additional findings.

Tech Debt

  • No existing entry under TODO.md#L51 covers the Phase 2 delta-inference, weighting, or sup-t issues above, so none of them are mitigated by tracking.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Compute analytical delta inference on the default n_bootstrap=0 path from multi_horizon_se, and add a regression test that overall_se is finite for a nondegenerate L_max > 1 fit.
  2. Decide the Phase 2 weighting contract and implement it consistently: either switch to the paper’s N_{g,t} weighting throughout the DID_l / placebo / normalized / delta / IF pipeline, or add an explicit REGISTRY Note/Deviation label that equal-cell weighting is intentionally carried forward to all Phase 2 estimands and add an unequal-cell regression test.
  3. Rework sup-t calibration to use one shared multiplier draw across all horizons in each bootstrap replicate, then add a regression test that the dCDH event-study bootstrap uses the same shared-weight pattern as staggered_bootstrap.py.
  4. Extend parity and unit coverage to assert horizon-level SE/CI, placebo outputs, and the overall_* contract for L_max > 1.

…l Note

- Compute delta-method SE regardless of bootstrap (was gated on
  bootstrap_results != None, leaving analytical path with NaN)
- Generate one shared bootstrap weight matrix for all horizons so
  sup-t bands are a valid joint multiplier-bootstrap band
- Add REGISTRY Note for Phase 2 equal-cell weighting deviation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 12, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 41bc1e8c3ca5a03b8bc1965782a782d1d17b72bf


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review status: the earlier P1s on equal-cell weighting, default analytical delta inference, and sup-t shared bootstrap draws look resolved or explicitly documented in the current diff.
  • P1 [Newly identified]: the new L_max path does not preserve the library’s documented terminal-missingness contract. A switcher/placebo can remain in N_l / N_pl_l after all observed baseline-matched controls disappear at that horizon, and the code zero-retains the group instead of excluding it.
  • That denominator issue affects DID_l, DID^{pl}_l, and downstream DID^n_l, delta, and per-horizon SEs because they reuse the same Phase 2 eligibility/count logic.
  • P2: the new R fixtures already contain horizon-level SE/CI and placebo outputs, but the parity suite only asserts point estimates and there is no Phase 2 ragged-panel regression test.
  • P3: public docs/docstrings still mix shipped Phase 2 behavior with stale “Phase 2 will add” / placeholder wording and outdated top-level bootstrap-inference semantics once overall_att becomes delta.
  • Static review only: I could not run the Python suite in this read-only sandbox with missing runtime deps.

Methodology

Affected methods: multi-horizon DID_l, dynamic placebos DID^{pl}_l, normalized DID^n_l, and cost-benefit delta. These are all part of the dynamic companion paper’s event-study/cost-benefit construction. (nber.org)

Code Quality

  • No additional findings beyond the methodology issue above.

Performance

  • No findings.

Maintainability

  • No additional findings.

Tech Debt

  • No additional findings. The active P1 above is not covered by TODO.md:L51-L77, so it remains unmitigated under the stated review rules.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Fix Phase 2 horizon eligibility so missingness-driven empty control pools are excluded from N_l / N_pl_l instead of zero-retained, and apply the same rule in _compute_per_group_if_multi_horizon().
  2. Add a regression test for L_max >= 2 on a terminally missing panel that asserts the affected switcher/placebo is excluded rather than zero-retained, and extend parity coverage to horizon-level SE/CI/placebo outputs.

…taining

When a switcher's control pool is empty at a given horizon (e.g.,
due to terminal missingness), exclude the group from N_l / N_pl_l
rather than zero-retaining it. Zero-retention biases DID_l toward
zero on ragged panels. Reserve zero-contribution for the IF path
only (where it naturally has no effect). Same fix applied to
_compute_multi_horizon_placebos.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 12, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a7f32990c6b2b9f8d5e3e1e630e0e08866775323


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The earlier ragged-panel / empty-control-pool finding is only partially fixed: DID_l now excludes those (g,l) pairs, but the new IF/SE/bootstrap path still treats them as zero contributions, so inference remains inconsistent on terminal-missing panels.
  • The Phase 2 surface now explicitly allows event_study_effects[1] to differ from Phase 1 DID_M when L_max >= 2, while the registry and README still state DID_M = DID_1.
  • When L_max > 1 and n_bootstrap > 0, top-level overall_* inference for the new default overall estimand delta is recomputed with safe_inference(), even though the registry/results docs and summary still promise percentile-bootstrap top-level inference.
  • The new R fixtures already contain SE/CI/placebo outputs, but the Phase 2 parity tests still pin only point estimates.
  • Static review only; I did not execute the suite in this read-only sandbox.

Methodology

Affected methods: multi-horizon DID_l, the per-group DID_{g,l} IF/SE path, DID_1, and top-level delta inference.

  • Severity: P1. Impact: _compute_multi_horizon_dids() now excludes switchers whose observed baseline-matched control pool is empty, but _compute_per_group_if_multi_horizon() still leaves those same cases in the IF as zero contributions, and both the analytical SE path and mh_boot_inputs recenter over that larger zero-retained sample while dividing by the reduced N_l. On terminal-missing panels, the point estimate and SE/CI/sup-t/delta SE are therefore computed from different eligibility rules. Concrete fix: make the IF/bootstrap path consume the finalized Phase 2 eligible_mask (or recompute the same empty-control exclusion there), and add a regression with a terminally missing control that asserts both DID_l and its inference exclude the same switcher. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L2242, diff_diff/chaisemartin_dhaultfoeuille.py:L2374, diff_diff/chaisemartin_dhaultfoeuille.py:L1096, diff_diff/chaisemartin_dhaultfoeuille.py:L1358.
  • Severity: P1. Impact: the Phase 2 API intentionally switches event_study_effects[1] to a different computation path whenever L_max >= 2, and the new test suite explicitly accepts that it “may differ” from Phase 1 DID_M. That conflicts with the documented/source-material identity DID_M = DID_1 and makes the l=1 estimate depend on whether the user asks for additional horizons. Concrete fix: either make event_study_effects[1] reuse the Phase 1 DID_M estimand when L_max is set, or add an explicit REGISTRY.md Note/Deviation and distinct user-facing labeling if the project intends Phase 2 DID_1 to be a different estimand. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L1439, tests/test_chaisemartin_dhaultfoeuille.py:L1839, docs/methodology/REGISTRY.md:L466, README.md:L1160.
  • Severity: P1. Impact: after bootstrap runs, Phase 2 overwrites the top-level overall estimand with delta and recomputes overall_p_value / overall_conf_int via safe_inference(delta, delta_se). The registry, results docstring, and summary still tell users that top-level overall_* fields are percentile-bootstrap when n_bootstrap > 0, so the new default overall estimand silently breaks the documented inference contract. Concrete fix: use the shared horizon bootstrap draws to form delta_b = Σ_l w_l DID_{l,b} and populate top-level overall_* from _compute_effect_bootstrap_stats; if the project intentionally wants normal-theory delta inference, document that exception everywhere and stop printing the percentile-bootstrap note for delta. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L1496, docs/methodology/REGISTRY.md:L542, docs/methodology/REGISTRY.md:L590, diff_diff/chaisemartin_dhaultfoeuille_results.py:L150, diff_diff/chaisemartin_dhaultfoeuille_results.py:L543.

Code Quality

  • No additional findings beyond the methodology issues above.

Performance

  • No findings.

Maintainability

  • No additional findings.

Tech Debt

  • No separate findings. I did not find matching TODO.md entries that would mitigate the active P1 issues above.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new golden fixtures already include horizon-level SE/CI/placebo outputs, but the Phase 2 parity suite still asserts only point estimates. That leaves the new inference surfaces unpinned and would not catch the delta bootstrap-contract regression above. Concrete fix: extend _check_multi_horizon() to assert overall_se, CI bounds, n_switchers, and placebo entries from the golden file, and add a dedicated ragged-panel L_max >= 2 regression for empty observed control pools. Location: tests/test_chaisemartin_dhaultfoeuille_parity.py:L239, benchmarks/R/generate_dcdh_dynr_test_values.R:L294.
  • Severity: P3. Impact: some user-facing docs still describe shipped Phase 2 behavior as future work and still say placebo bootstrap “will add” in Phase 2, which no longer matches the code/registry. Concrete fix: reconcile the README and dCDH results docstrings with the actual L_max/delta/dynamic-placebo surface. Location: README.md:L1160, README.md:L1255.

Path to Approval

  1. Make the Phase 2 IF/bootstrap path use the same post-exclusion horizon eligibility that DID_l uses, then add a terminal-missingness regression that checks both the estimate and its inference.
  2. Resolve the DID_M versus DID_1 contract: either keep l=1 invariant to L_max or explicitly document and relabel the alternate Phase 2 l=1 estimand in REGISTRY.md and user-facing docs.
  3. Generate a bootstrap distribution for delta from the shared horizon draws and populate top-level overall_* from it; otherwise, document the analytical fallback everywhere and stop advertising percentile-bootstrap top-level inference for L_max > 1.

…delta contracts

- IF/SE/bootstrap paths now use the same combined eligibility mask as
  _compute_multi_horizon_dids (singleton-baseline + empty-control-pool
  exclusions), so point estimate and inference agree on terminal-missing
  panels
- Add REGISTRY Note documenting that event_study_effects[1] uses per-group
  DID_{g,1} (cohort-based controls) when L_max >= 2, which may differ
  from Phase 1 DID_M (period-based controls) on mixed-direction panels
- Add REGISTRY Note documenting that delta SE uses delta-method
  (normal-theory) even when bootstrap is enabled, as an intentional
  exception to the bootstrap-inference-surface contract

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 12, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 4d71c59d7320d3a21343cb72aa6455a0fbe1fa98


Overall Assessment

⛔ Blocker

Executive Summary

  • Affected methods: multi-horizon DID_l, dynamic placebos DID^{pl}_l, normalized DID^n_l, cost-benefit delta, and the new multiplier-bootstrap / sup-t inference surface.
  • The prior DID_1 vs DID_M identity issue and the top-level delta inference-contract issue are now explicitly documented in docs/methodology/REGISTRY.md, so they are no longer blocking defects under the registry rules.
  • The new Phase 2 variance path still drops control-role groups out of the horizon IF before recentering and bootstrapping, so the new SEs/CIs/p-values/sup-t bands are not based on the full group-level IF.
  • The new delta inference path silently skips weighted horizons with NaN SE instead of propagating NaN to all top-level inference fields, which can yield misleading finite p-values/CIs.
  • The new golden fixtures include horizon SE/CI/placebo outputs, but the parity suite still only pins point estimates, so it would not catch the new inference bugs.
  • Static review only; I did not execute the suite here because the sandbox Python environment is missing numpy.

Methodology

  • Severity: P0. Impact: overall_att becomes the cost-benefit delta, but its SE path silently drops any positively weighted horizon whose SE is non-finite (if np.isfinite(se_l): var_delta += ...). That means overall_p_value / overall_conf_int can remain finite even when part of the aggregate has undefined inference, violating the project’s NaN-consistent inference contract for statistical output. Concrete fix: before computing delta_se, require every horizon with w_l > 0 to have finite SE; otherwise set overall_se, overall_t_stat, overall_p_value, and overall_conf_int to NaN. If you want non-NaN inference here, bootstrap delta directly from aligned horizon draws instead of dropping horizons term-by-term. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L1510-L1539, docs/methodology/REGISTRY.md:L576-L580, diff_diff/utils.py:L152-L186.

  • Severity: P1. Impact: _compute_per_group_if_multi_horizon() explicitly builds a full per-group IF with both switcher and control-role contributions, but the analytical SE path and the bootstrap path immediately intersect that IF with did_eligible, i.e. the switcher-only mask. That drops never-switchers and later-switching controls even when they carry non-zero IF mass, so the new analytical SEs, bootstrap SEs/CIs/p-values, sup-t bands, normalized-effect SEs, and delta SE are all based on an incomplete IF and will be understated. This is not aligned with the registry’s group-level IF contract, which says stable controls can have non-zero U^G_g. Concrete fix: keep the Phase 2 U_l vectors at full variance-eligible group length (apply only the singleton-baseline variance filter), recenter on that full group set, and feed those full-length vectors into the shared-weight bootstrap so group identities stay aligned across horizons. Location: diff_diff/chaisemartin_dhaultfoeuille.py:L1096-L1105, diff_diff/chaisemartin_dhaultfoeuille.py:L1364-L1393, diff_diff/chaisemartin_dhaultfoeuille.py:L2300-L2398, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L254-L309, docs/methodology/REGISTRY.md:L564-L570, docs/methodology/REGISTRY.md:L590-L592.

  • Severity: P3. Impact: The prior re-review blockers around event_study_effects[1] versus Phase 1 DID_M, and around normal-theory delta inference under n_bootstrap > 0, are now explicitly documented in the registry with Note labels, so they no longer count as methodology defects under the stated review policy. Concrete fix: None required for approval; optional user-doc sync remains below. Location: docs/methodology/REGISTRY.md:L536-L546.

Code Quality

  • No additional findings beyond the methodology issues above.

Performance

  • No findings.

Maintainability

  • No additional findings.

Tech Debt

  • Severity: P3. Impact: TODO.md does not currently track the new Phase 2 parity/docs follow-up work, so if that cleanup is deferred it is not mitigated by the project’s deferred-work policy. The active P0/P1 inference bugs are not TODO-mitigable in any case. Concrete fix: if you defer the P2 test/doc cleanup, add an entry under TODO.md → “Tech Debt from Code Reviews.” Location: TODO.md:L51-L77.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new golden-value generator now records per-horizon SE, CI, placebo, and switcher-count outputs, but the multi-horizon parity suite still asserts only point estimates, and the sup-t tests only check that simultaneous bands are wider than pointwise CIs on a synthetic panel. That would not catch either inference bug above. Concrete fix: extend _check_multi_horizon() to assert horizon overall_se, CI bounds, n_switchers, and placebo outputs from the golden file, and add one regression where a control-only group has non-zero U_l outside the switcher set plus one regression where a positive-weight horizon has NaN SE and delta inference must be all-NaN. Location: benchmarks/R/generate_dcdh_dynr_test_values.R:L294-L325, tests/test_chaisemartin_dhaultfoeuille_parity.py:L239-L258, tests/test_chaisemartin_dhaultfoeuille.py:L2069-L2101.

  • Severity: P3. Impact: several user-facing docs still describe Phase 2 behavior as future work or say top-level overall_* inference is always percentile-bootstrap when n_bootstrap > 0, which is no longer true once overall_att is repurposed to delta; the placebo notes also still say Phase 2 will add bootstrap placebo inference even though the registry now documents that as deferred follow-up. Concrete fix: align README / API docs / results docstring / summary note with the registry’s Phase 2 notes. Location: README.md:L1160-L1160, README.md:L1208-L1255, diff_diff/chaisemartin_dhaultfoeuille_results.py:L150-L166, diff_diff/chaisemartin_dhaultfoeuille_results.py:L543-L548, docs/api/chaisemartin_dhaultfoeuille.rst:L199-L214, docs/choosing_estimator.rst:L289-L295, docs/llms-full.txt:L321-L323.

Path to Approval

  1. Rebuild the Phase 2 analytical SE and bootstrap inputs on the full group-level horizon IF, not the switcher-only did_eligible subset. A minimal regression should assert that a control-only group can carry non-zero U_l and changes the horizon SE.
  2. Make delta inference NaN-consistent: if any positively weighted horizon has non-finite SE, set all top-level delta inference fields to NaN, or bootstrap delta directly from aligned horizon draws. Add a regression with one weighted horizon degenerate and assert overall_se, overall_t_stat, overall_p_value, and overall_conf_int all become NaN.
  3. Expand the multi-horizon parity coverage to pin SE/CI/placebo/switcher-count outputs from the existing golden fixtures so the inference surface is actually protected in CI.

- Delta SE now requires ALL positively-weighted horizons to have finite
  SE; if any has NaN, overall_se/p/CI are all NaN (NaN-consistent
  inference contract)
- IF/SE/bootstrap paths use full variance-eligible group set (singleton-
  baseline filter only), not the switcher-only did_eligible mask. Never-
  switchers and later-switching controls with non-zero IF mass from
  their control roles are now included, matching the Phase 1 IF contract.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 12, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8cac82d46f72e85861d031788783fb2b538ad382


Overall Assessment

✅ Looks good

Executive Summary

  • Affected methods: multi-horizon DID_l, dynamic placebos DID^{pl}_l, normalized DID^n_l, cost-benefit delta, and the multi-horizon variance/bootstrap surface from the dynamic companion paper. (nber.org)
  • The prior blocker-level issues appear resolved: the multi-horizon analytical SE/bootstrap now keep the full variance-eligible group-level IF, and delta inference now propagates NaN consistently when a positively weighted horizon lacks a finite SE.
  • The remaining substantive gap is non-blocking: the new multi-horizon parity fixtures include SE/CI/placebo/switcher-count outputs, but the parity test still asserts only point estimates.
  • One user-facing warning is misleading: it says affected horizons are zero-retained in N_l / N^{pl}_l, while the helper actually drops those groups from the denominator.
  • Static diff review only; I did not execute the test suite here.

Methodology

No unmitigated P0/P1 findings. The previously reported IF-scope and delta NaN-consistency blockers appear fixed in diff_diff/chaisemartin_dhaultfoeuille.py:L1096-L1105, diff_diff/chaisemartin_dhaultfoeuille.py:L1364-L1392, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L254-L309, and diff_diff/chaisemartin_dhaultfoeuille.py:L1521-L1548. The remaining deviations I reviewed are explicitly documented in docs/methodology/REGISTRY.md:L520-L546, so they are non-blocking under the stated policy. (nber.org)

Code Quality

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity: P3
    Impact: The remaining test/doc cleanup below is deferrable, but it is not currently tracked under TODO.md's "Tech Debt from Code Reviews" section.
    Concrete fix: If the parity/docs cleanup is deferred, add one entry under TODO.md:L51-L97 referencing this PR and the specific deferred work.

Security

No findings.

Documentation/Tests

  • Severity: P2
    Impact: The new R fixture generator now records per-horizon SEs, CI bounds, placebo outputs, and switcher counts, but the new multi-horizon parity test still asserts only point estimates. A regression in horizon inference, placebo outputs, or n_switchers would bypass the main parity guard.
    Concrete fix: Extend _check_multi_horizon() to assert overall_se, overall_ci_lo, overall_ci_hi, n_switchers, and placebo outputs from the existing golden file, and add a dedicated regression where a positively weighted horizon has NaN SE so overall_se, overall_t_stat, overall_p_value, and overall_conf_int must all be NaN. Refs: benchmarks/R/generate_dcdh_dynr_test_values.R:L294-L325, tests/test_chaisemartin_dhaultfoeuille_parity.py:L221-L259.

  • Severity: P3
    Impact: README still mixes shipped-Phase-2 wording with older "Phase 2 will add DID_l" text and still describes overall_att as always DID_M, which conflicts with the new L_max > 1 behavior where overall_att becomes delta.
    Concrete fix: Align the README overview/table with the current registry and results surface. Refs: README.md:L1160-L1160, README.md:L1208-L1210, README.md:L1216-L1239, docs/methodology/REGISTRY.md:L520-L546.

- Fix multi-horizon control-availability warnings to say "excluded
  from N_l" (not "zeroed and retained")
- Update README overall_att description to reflect delta when L_max > 1
- Update README placebo Note to reflect current state (not "Phase 2
  will add")
- Update ROADMAP 2c and 2h status to reflect placebo SE and parity
  SE/placebo deferrals

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 12, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 39f3978296dbe18608febbd28c52a2a4b4c66afa


Overall Assessment

⛔ Blocker

The re-review found one unmitigated P0 in the new Phase 2 dynamic placebo path and one unmitigated P1 in the new Phase 2 top-level delta surface. Static diff review only; I did not execute the test suite in this environment.

Executive Summary

  • Affected methods reviewed: multi-horizon DID_l, dynamic placebos DID^{pl}_l, normalized DID^n_l, cost-benefit delta, and the Phase 2 inference/result-propagation surface.
  • P0: the new placebo helper can include terminally missing controls that are not observed at the forward eligibility horizon, which changes control-group composition for DID^{pl}_l with no warning.
  • P1: when Phase 2 delta is non-estimable, the top-level overall_* fields can silently remain the old Phase 1 DID_M values even though the results surface labels them as delta.
  • The prior warning-text inconsistency around exclusion from N_l / N^{pl}_l appears fixed.
  • The prior non-blocking parity gap remains: the new golden fixtures include SE/CI/placebo/n_switchers, but the multi-horizon parity test still checks only point estimates.

Methodology

Affected methods: DID_l, DID^{pl}_l, DID^n_l, and delta.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity: P3. Impact: the remaining non-blocking parity/docs follow-ups are not tracked under TODO.md's “Tech Debt from Code Reviews” section, so if they are intentionally deferred they have no project-level breadcrumb. Concrete fix: if those items are not fixed in this PR, add a TODO.md entry referencing this PR and the specific deferred work. Refs: TODO.md:L51-L98

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new R fixture generator now stores per-horizon SEs, CI bounds, placebo outputs, and n_switchers, but _check_multi_horizon() still asserts only point estimates. A regression in Phase 2 inference, placebo outputs, or switcher counts would bypass the main parity guard. Concrete fix: extend _check_multi_horizon() to assert at least per-horizon SE, CI bounds, n_switchers, and placebo point estimates from the existing golden file. Refs: benchmarks/R/generate_dcdh_dynr_test_values.R:L294-L325, tests/test_chaisemartin_dhaultfoeuille_parity.py:L239-L258

  • Severity: P3. Impact: several changed docs still contain stale or internally inconsistent Phase 2 language. Examples: README still says “Phase 2 will add” multi-horizon output, the registry checklist still says L_max is behind a NotImplementedError gate, and other docs still say placebo bootstrap will arrive “in Phase 2” even though the new registry note says placebo inference remains deferred. Concrete fix: align README/REGISTRY/chooser docs with the shipped Phase 2 contract and the still-deferred placebo inference work. Refs: README.md:L1160, docs/choosing_estimator.rst:L289-L295, docs/methodology/REGISTRY.md:L586-L586, docs/methodology/REGISTRY.md:L614-L617

Path to Approval

  1. In _compute_multi_horizon_placebos(), exclude controls that are not observed at forward_idx, and make sure T_g cannot extend placebo support via terminally missing groups.
  2. Add a regression test where a terminally missing baseline-matched control would currently enter the placebo pool, and assert it is excluded.
  3. When L_max >= 2 and cost_benefit_delta["delta"] is non-finite, set all top-level overall_* fields to NaN with a warning instead of leaving the Phase 1 DID_M values in place.
  4. Add a regression test for a Phase 2 run with non-estimable delta and assert overall_att, overall_se, overall_t_stat, overall_p_value, and overall_conf_int are NaN-consistent.

…lback

- Add N_mat[ctrl, forward_idx] > 0 to placebo control mask so
  terminally missing controls don't leak into DID^{pl}_l
- When delta is NaN (non-estimable), set all overall_* to NaN
  instead of silently falling back to the Phase 1 DID_M values

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 12, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a3b01bc706eaead4d89640cd8b06523341a9929d


Overall Assessment

✅ Looks good

Executive Summary

Methodology

Code Quality

  • Severity: P2. Impact: the Phase 2 summary note is now semantically wrong for overall_* when L_max >= 2 and bootstrap is enabled. The code prints “multiplier-bootstrap percentile inference” whenever bootstrap_results exists and overall_se is finite, but the new Phase 2 contract explicitly computes delta inference from safe_inference(delta, delta_se) instead. That can mislead users about what interval/p-value they are reading. Concrete fix: branch the summary note on self.L_max >= 2 and self.cost_benefit_delta is not None, and describe the delta path as delta-method/normal-theory even when per-horizon SEs came from bootstrap (diff_diff/chaisemartin_dhaultfoeuille_results.py:L543-L552, docs/methodology/REGISTRY.md:L544-L546, docs/methodology/REGISTRY.md:L592-L592).

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: the remaining deferred parity/docs follow-ups are still not tracked in TODO.md’s “Tech Debt from Code Reviews” section, so there is no project breadcrumb if they stay deferred. Concrete fix: add a TODO.md entry tying this PR to the still-deferred parity expansion and doc cleanup (TODO.md:L51-L97).

Security

  • No findings.

Documentation/Tests

- summary() now describes delta SE as delta-method (normal-theory)
  when L_max >= 2 with bootstrap, instead of claiming percentile
- Update README, choosing_estimator.rst, REGISTRY.md to reflect
  shipped Phase 2 state (was: "Phase 2 will add")

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 12, 2026
@igerber igerber merged commit 6e43050 into main Apr 12, 2026
24 of 26 checks passed
@igerber igerber deleted the dcdh-phase2 branch April 12, 2026 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant