Fix/1616 pd na in by label count by ugbotueferhire · Pull Request #1880 · evidentlyai/evidently

ugbotueferhire · 2026-05-14T10:15:05Z

Description

DataSummaryPreset and DataDriftPreset crashed with a pydantic.ValidationError whenever a categorical column contained pd.NA — pandas' native missing-value sentinel for nullable dtypes (string, Int64, boolean, etc.). Users had to convert pd.NA to None before passing data in, defeating the point of running exploratory metrics on raw data.

This PR makes both presets handle pd.NA transparently. Missing values are treated as missing — counted by MissingValueCount, never surfaced as a phantom <NA> label in ByLabelCountValue. The fix is three small, layered changes (one direct cause + two defensive guards so the same class of bug can't quietly resurface in future ByLabelCount metrics). No public API changes; no Label type change.

import pandas as pd
from evidently import DataDefinition, Dataset, Report
from evidently.presets import DataSummaryPreset

df = pd.DataFrame({"a": pd.Series(["x", "y", "z", pd.NA], dtype="string")})
ds = Dataset.from_pandas(df, DataDefinition(categorical_columns=["a"]))
Report([DataSummaryPreset()]).run(current_data=ds)
# Before: pydantic.ValidationError on ByLabelCountValue (counts/shares __key__)
# After:  runs cleanly; pd.NA is treated as missing, not as a label.

The same failure hit DataDriftPreset. Both presets now pass.

Root cause

Three cooperating defects:

UniqueValueCountCalculation._all_unique_values used pd.Series.unique(), which preserves pd.NA / np.nan. The NA sentinel was then carried into the result dict as a label key.
ByLabelCountValue lacked the @validator(..., pre=True) key-coercion that its sibling ByLabelValue has, so non-Label-typed dict keys (e.g. np.int64, pd.NA) reached pydantic uncoerced.
convert_types relied on np.isnan(val), which raises TypeError: boolean value of NA is ambiguous on pd.NA.

Changes

src/evidently/metrics/column_statistics.py — UniqueValueCountCalculation._all_unique_values now calls .dropna().unique(), aligning the unique-value set with the existing value_counts(dropna=True) on the count side. Missing values continue to be reported separately by MissingValueCount (no double-counting).
src/evidently/core/metric_types.py
- Add @validator("counts", "shares", pre=True) on ByLabelCountValue, mirroring ByLabelValue.convert_labels.
- Identity-check pd.NA in convert_types and normalize to None, consistent with Label = Union[StrictBool, int, str, None]. np.nan behavior is preserved (still falls through to pydantic's str-coercion path, matching the existing test contract).

Tests

New tests/future/presets/test_dataset_stats_na.py — end-to-end coverage matching the issue's exact reproduction on both DataSummaryPreset and DataDriftPreset.
Extended tests/future/metrics/test_unique_value_count.py — cases for pd.NA in string / object / Int64 dtypes.
Extended tests/future/test_metric_types.py — parametrize cases covering pd.NA and np.int64 keys on ByLabelCountValue.

UniqueValueCountCalculation._all_unique_values used pd.Series.unique() which preserves pd.NA / np.nan / None. The seed dict then carried the missing-value sentinel as a label key, which pydantic rejected against Label = Union[StrictBool, int, str, None]. value_counts(dropna=True) already drops NA on the count side, so this aligns the label set with the count set. Missing values continue to be reported separately by MissingValueCount. Fixes evidentlyai#1616 for DataSummaryPreset / DataDriftPreset on categorical columns containing pd.NA.

ByLabelCountValue lacked the pre=True key-coercion validator that its sibling ByLabelValue has, so np.int64 / pd.NA / other non-Label-typed keys reached pydantic uncoerced. Add @validator("counts", "shares", pre=True) mirroring ByLabelValue.convert_labels. convert_types relied on np.isnan(val) which raises TypeError on pd.NA ("boolean value of NA is ambiguous"). Add an identity check for pd.NA that normalizes it to None, consistent with Label = Union[..., None]. np.nan behavior is preserved (still falls through to pydantic's str-coercion path, matching the pre-existing test contract). Adds end-to-end coverage matching the exact reproduction from evidentlyai#1616 on both DataSummaryPreset and DataDriftPreset, plus parametrize cases for pd.NA and np.int64 keys. Fixes evidentlyai#1616.

ugbotueferhire added 2 commits May 14, 2026 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/1616 pd na in by label count#1880

Fix/1616 pd na in by label count#1880
ugbotueferhire wants to merge 2 commits into
evidentlyai:mainfrom
ugbotueferhire:fix/1616-pd-na-in-by-label-count

ugbotueferhire commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ugbotueferhire commented May 14, 2026

Description

Root cause

Changes

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant