Skip to content

Fix/1616 pd na in by label count#1880

Open
ugbotueferhire wants to merge 2 commits into
evidentlyai:mainfrom
ugbotueferhire:fix/1616-pd-na-in-by-label-count
Open

Fix/1616 pd na in by label count#1880
ugbotueferhire wants to merge 2 commits into
evidentlyai:mainfrom
ugbotueferhire:fix/1616-pd-na-in-by-label-count

Conversation

@ugbotueferhire
Copy link
Copy Markdown

Fixes #1616.

Description

DataSummaryPreset and DataDriftPreset crashed with a pydantic.ValidationError whenever a categorical column contained pd.NA — pandas' native missing-value sentinel for nullable dtypes (string, Int64, boolean, etc.). Users had to convert pd.NA to None before passing data in, defeating the point of running exploratory metrics on raw data.

This PR makes both presets handle pd.NA transparently. Missing values are treated as missing — counted by MissingValueCount, never surfaced as a phantom <NA> label in ByLabelCountValue. The fix is three small, layered changes (one direct cause + two defensive guards so the same class of bug can't quietly resurface in future ByLabelCount metrics). No public API changes; no Label type change.

import pandas as pd
from evidently import DataDefinition, Dataset, Report
from evidently.presets import DataSummaryPreset

df = pd.DataFrame({"a": pd.Series(["x", "y", "z", pd.NA], dtype="string")})
ds = Dataset.from_pandas(df, DataDefinition(categorical_columns=["a"]))
Report([DataSummaryPreset()]).run(current_data=ds)
# Before: pydantic.ValidationError on ByLabelCountValue (counts/shares __key__)
# After:  runs cleanly; pd.NA is treated as missing, not as a label.

The same failure hit DataDriftPreset. Both presets now pass.

Root cause

Three cooperating defects:

  1. UniqueValueCountCalculation._all_unique_values used pd.Series.unique(), which preserves pd.NA / np.nan. The NA sentinel was then carried into the result dict as a label key.
  2. ByLabelCountValue lacked the @validator(..., pre=True) key-coercion that its sibling ByLabelValue has, so non-Label-typed dict keys (e.g. np.int64, pd.NA) reached pydantic uncoerced.
  3. convert_types relied on np.isnan(val), which raises TypeError: boolean value of NA is ambiguous on pd.NA.

Changes

  • src/evidently/metrics/column_statistics.pyUniqueValueCountCalculation._all_unique_values now calls .dropna().unique(), aligning the unique-value set with the existing value_counts(dropna=True) on the count side. Missing values continue to be reported separately by MissingValueCount (no double-counting).
  • src/evidently/core/metric_types.py
    • Add @validator("counts", "shares", pre=True) on ByLabelCountValue, mirroring ByLabelValue.convert_labels.
    • Identity-check pd.NA in convert_types and normalize to None, consistent with Label = Union[StrictBool, int, str, None]. np.nan behavior is preserved (still falls through to pydantic's str-coercion path, matching the existing test contract).

Tests

  • New tests/future/presets/test_dataset_stats_na.py — end-to-end coverage matching the issue's exact reproduction on both DataSummaryPreset and DataDriftPreset.
  • Extended tests/future/metrics/test_unique_value_count.py — cases for pd.NA in string / object / Int64 dtypes.
  • Extended tests/future/test_metric_types.py — parametrize cases covering pd.NA and np.int64 keys on ByLabelCountValue.

UniqueValueCountCalculation._all_unique_values used pd.Series.unique()
which preserves pd.NA / np.nan / None. The seed dict then carried the
missing-value sentinel as a label key, which pydantic rejected against
Label = Union[StrictBool, int, str, None].

value_counts(dropna=True) already drops NA on the count side, so this
aligns the label set with the count set. Missing values continue to be
reported separately by MissingValueCount.

Fixes evidentlyai#1616 for DataSummaryPreset / DataDriftPreset on categorical
columns containing pd.NA.
ByLabelCountValue lacked the pre=True key-coercion validator that its
sibling ByLabelValue has, so np.int64 / pd.NA / other non-Label-typed
keys reached pydantic uncoerced. Add @validator("counts", "shares",
pre=True) mirroring ByLabelValue.convert_labels.

convert_types relied on np.isnan(val) which raises TypeError on pd.NA
("boolean value of NA is ambiguous"). Add an identity check for pd.NA
that normalizes it to None, consistent with Label = Union[..., None].
np.nan behavior is preserved (still falls through to pydantic's
str-coercion path, matching the pre-existing test contract).

Adds end-to-end coverage matching the exact reproduction from evidentlyai#1616
on both DataSummaryPreset and DataDriftPreset, plus parametrize cases
for pd.NA and np.int64 keys.

Fixes evidentlyai#1616.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Preset Failures with pandas NA class pd.NA (pydantic ValidationError)

1 participant