Skip to content

Project summary_stats wire payload to only the stats the frontend reads #880

Description

@paddymul

Summary

The all_stats payload sent to the frontend is built by sd_to_parquet_b64(merged_sd)
(serialization_utils.py:397, wired in at dataflow.py:698,702 and buckaroo_widget.py:364)
and serializes the entire merged_sd — every stat for every column. The frontend only
consumes a small subset:

  • histogram_bins / histogram_log_bins per column, for color-map binning (SDFT /
    SDFMeasure in DFWhole.ts), and
  • the pinned-row values actually rendered in the summary grid — dtype, mean, std,
    min, median, max, most_freq..5th_freq, non_null_count, null_count,
    unique_count, distinct_count (whatever the active StylingAnalysis.pinned_rows reference).

Everything else — value_counts (a whole pd.Series), histogram_args, memory_usage,
the is_* typing flags, the heuristic *_frac cleaning stats — is shipped to the browser and
never read.

Impact

Measured payload (parquet, then +33% as base64 on the wire):

frame all_stats parquet as b64
5k rows × 20 cols 332 KB 443 KB
5k rows × 191 cols 3.2 MB 4.3 MB

Most of that is dead weight. Trimming to the displayed subset shrinks the initial_state /
traitlet payload (the parquet parse in hyparquet scales with payload width and is a real
first-paint cost). It also directly shrinks persisted first-load cache bundles (#877).

Suggested fix

Derive the frontend-needed stat-key set from the active styling classes (pinned_rows + the
histogram-bin needs) and project merged_sd down to just those keys before sd_to_parquet_b64.
Keep the full merged_sd server-side / on the dataflow for styling regeneration and future
consumers; only the wire projection shrinks.

Scope

serialization_utils.py (a projection step before / inside sd_to_parquet_b64),
dataflow/styling_core.py + customizations/styling.py (source of the displayed-key set),
dataflow.py / buckaroo_widget.py (where all_stats is assembled). Widget + server share
the path.

Context

Identified while designing the initial-load cache (#877): the cache must persist all_stats, and
measuring it showed most of the payload is never read by the frontend.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions