Skip to content

fix: dynamic fill_zero in get_binned_data to prevent KL/PSI score distortion#1878

Open
SuryaSunil1326 wants to merge 1 commit into
evidentlyai:mainfrom
SuryaSunil1326:fix/dynamic-fill-zero-binned-data
Open

fix: dynamic fill_zero in get_binned_data to prevent KL/PSI score distortion#1878
SuryaSunil1326 wants to merge 1 commit into
evidentlyai:mainfrom
SuryaSunil1326:fix/dynamic-fill-zero-binned-data

Conversation

@SuryaSunil1326
Copy link
Copy Markdown

Summary

Fixes #334.

The hardcoded fill_zero value in get_binned_data caused incorrect KL divergence (and PSI) scores in two ways:

  1. Disproportionate fill when min percent ≤ 0.0001 — filling with min / 1e6 produced an astronomically small value (e.g. 1e-11), causing log(p/q) to blow up and inflate the drift score significantly.
  2. Asymmetric fill — reference and current distributions received independently computed fill values, introducing an artificial bias into symmetric divergence metrics.

Fix: compute a single fill value as min_nonzero_both / 10, where min_nonzero_both is the smallest genuine non-zero probability across both distributions. This guarantees:

  • The fill is always strictly smaller than any real probability (no inflation)
  • Both distributions use the same fill (no asymmetry)
  • Edge case (all zeros) falls back to 1e-4

The same fix is applied to the Spark implementation in legacy/spark/calculations/stattests/utils.py.

Reproducer (from issue)

# Before fix: KL score ≈ 1.9 (inflated)
# After fix:  KL score ≈ 0.41 (correct)
reference = pd.Series([a] * 99_999 + [b])   # p(b) ≈ 1e-5
current   = pd.Series([a] * 99_999 + [c])   # c unseen in reference

Test plan

  • test_get_binned_data_fill_zero_is_dynamic — verifies fill is strictly smaller than any real probability and KL of identical distributions is ~0
  • test_get_binned_data_fill_zero_symmetric — verifies reference and current receive the same fill value
  • Full stattest suite passes (103 tests)

The hardcoded fill value (0.0001 or min/1e6) for zero-probability bins
in get_binned_data caused two problems:

1. When min non-zero percent <= 0.0001, filling with min/1e6 produced
   an astronomically small value that inflated KL divergence scores.
2. Reference and current distributions received different fill values,
   introducing asymmetry in divergence metrics.

Fix: compute a single fill value as min_nonzero_both / 10, where
min_nonzero_both is the smallest non-zero probability across both
distributions. This guarantees the fill is always smaller than any
genuine probability and is applied symmetrically.

Fixes evidentlyai#334
@MukundaKatta
Copy link
Copy Markdown

Hit the same asymmetric-fill bug shipping a Rust PSI yesterday. The
min_nonzero_both/10 scaling is nicer than my flat eps default. One question
on the design: I add eps at the count level before normalizing so the fill
stays scale-invariant for small vs huge total counts. Any reason here to
prefer the post-normalization probability fill instead?

@SuryaSunil1326
Copy link
Copy Markdown
Author

Interesting — hadn't thought of the count-level approach. You're right that it's cleaner in theory: fill is purely 1/N regardless of what the distribution looks like. Ours ties fill to min_nonzero, so if the rarest observed bin has count k rather than 1, fill becomes k/10N rather than 1/10N — distribution-shape-dependent.

Main reason I went post-normalization: get_binned_data returns percents directly to kl_div/psi/jensenshannon, so adding eps at count level means re-normalizing before returning, which nudges non-zero bins too. Felt like unnecessary churn for a targeted fix. But for PSI I can see your approach being more robust — if both distributions are sparse with non-overlapping support, min_nonzero could end up larger than ideal and the fill isn't as conservative as it should be.

Worth adding a smoothing parameter for count-level Laplace as a follow-up? Could default to current behavior for backwards compat. Happy to take a stab at it if you think it's useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case.

3 participants