Fix/1616 pd na in by label count#1880
Open
ugbotueferhire wants to merge 2 commits into
Open
Conversation
UniqueValueCountCalculation._all_unique_values used pd.Series.unique() which preserves pd.NA / np.nan / None. The seed dict then carried the missing-value sentinel as a label key, which pydantic rejected against Label = Union[StrictBool, int, str, None]. value_counts(dropna=True) already drops NA on the count side, so this aligns the label set with the count set. Missing values continue to be reported separately by MissingValueCount. Fixes evidentlyai#1616 for DataSummaryPreset / DataDriftPreset on categorical columns containing pd.NA.
ByLabelCountValue lacked the pre=True key-coercion validator that its sibling ByLabelValue has, so np.int64 / pd.NA / other non-Label-typed keys reached pydantic uncoerced. Add @validator("counts", "shares", pre=True) mirroring ByLabelValue.convert_labels. convert_types relied on np.isnan(val) which raises TypeError on pd.NA ("boolean value of NA is ambiguous"). Add an identity check for pd.NA that normalizes it to None, consistent with Label = Union[..., None]. np.nan behavior is preserved (still falls through to pydantic's str-coercion path, matching the pre-existing test contract). Adds end-to-end coverage matching the exact reproduction from evidentlyai#1616 on both DataSummaryPreset and DataDriftPreset, plus parametrize cases for pd.NA and np.int64 keys. Fixes evidentlyai#1616.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1616.
Description
DataSummaryPresetandDataDriftPresetcrashed with apydantic.ValidationErrorwhenever a categorical column containedpd.NA— pandas' native missing-value sentinel for nullable dtypes (string,Int64,boolean, etc.). Users had to convertpd.NAtoNonebefore passing data in, defeating the point of running exploratory metrics on raw data.This PR makes both presets handle
pd.NAtransparently. Missing values are treated as missing — counted byMissingValueCount, never surfaced as a phantom<NA>label inByLabelCountValue. The fix is three small, layered changes (one direct cause + two defensive guards so the same class of bug can't quietly resurface in futureByLabelCountmetrics). No public API changes; noLabeltype change.The same failure hit
DataDriftPreset. Both presets now pass.Root cause
Three cooperating defects:
UniqueValueCountCalculation._all_unique_valuesusedpd.Series.unique(), which preservespd.NA/np.nan. The NA sentinel was then carried into the result dict as a label key.ByLabelCountValuelacked the@validator(..., pre=True)key-coercion that its siblingByLabelValuehas, so non-Label-typed dict keys (e.g.np.int64,pd.NA) reached pydantic uncoerced.convert_typesrelied onnp.isnan(val), which raisesTypeError: boolean value of NA is ambiguousonpd.NA.Changes
src/evidently/metrics/column_statistics.py—UniqueValueCountCalculation._all_unique_valuesnow calls.dropna().unique(), aligning the unique-value set with the existingvalue_counts(dropna=True)on the count side. Missing values continue to be reported separately byMissingValueCount(no double-counting).src/evidently/core/metric_types.py@validator("counts", "shares", pre=True)onByLabelCountValue, mirroringByLabelValue.convert_labels.pd.NAinconvert_typesand normalize toNone, consistent withLabel = Union[StrictBool, int, str, None].np.nanbehavior is preserved (still falls through to pydantic's str-coercion path, matching the existing test contract).Tests
tests/future/presets/test_dataset_stats_na.py— end-to-end coverage matching the issue's exact reproduction on bothDataSummaryPresetandDataDriftPreset.tests/future/metrics/test_unique_value_count.py— cases forpd.NAinstring/object/Int64dtypes.tests/future/test_metric_types.py— parametrize cases coveringpd.NAandnp.int64keys onByLabelCountValue.