fix: dynamic fill_zero in get_binned_data to prevent KL/PSI score distortion#1878
fix: dynamic fill_zero in get_binned_data to prevent KL/PSI score distortion#1878SuryaSunil1326 wants to merge 1 commit into
Conversation
The hardcoded fill value (0.0001 or min/1e6) for zero-probability bins in get_binned_data caused two problems: 1. When min non-zero percent <= 0.0001, filling with min/1e6 produced an astronomically small value that inflated KL divergence scores. 2. Reference and current distributions received different fill values, introducing asymmetry in divergence metrics. Fix: compute a single fill value as min_nonzero_both / 10, where min_nonzero_both is the smallest non-zero probability across both distributions. This guarantees the fill is always smaller than any genuine probability and is applied symmetrically. Fixes evidentlyai#334
|
Hit the same asymmetric-fill bug shipping a Rust PSI yesterday. The |
|
Interesting — hadn't thought of the count-level approach. You're right that it's cleaner in theory: fill is purely 1/N regardless of what the distribution looks like. Ours ties fill to min_nonzero, so if the rarest observed bin has count k rather than 1, fill becomes k/10N rather than 1/10N — distribution-shape-dependent. Main reason I went post-normalization: Worth adding a |
Summary
Fixes #334.
The hardcoded
fill_zerovalue inget_binned_datacaused incorrect KL divergence (and PSI) scores in two ways:min / 1e6produced an astronomically small value (e.g.1e-11), causinglog(p/q)to blow up and inflate the drift score significantly.Fix: compute a single fill value as
min_nonzero_both / 10, wheremin_nonzero_bothis the smallest genuine non-zero probability across both distributions. This guarantees:1e-4The same fix is applied to the Spark implementation in
legacy/spark/calculations/stattests/utils.py.Reproducer (from issue)
Test plan
test_get_binned_data_fill_zero_is_dynamic— verifies fill is strictly smaller than any real probability and KL of identical distributions is ~0test_get_binned_data_fill_zero_symmetric— verifies reference and current receive the same fill value