Summary
The all_stats payload sent to the frontend is built by sd_to_parquet_b64(merged_sd)
(serialization_utils.py:397, wired in at dataflow.py:698,702 and buckaroo_widget.py:364)
and serializes the entire merged_sd — every stat for every column. The frontend only
consumes a small subset:
histogram_bins / histogram_log_bins per column, for color-map binning (SDFT /
SDFMeasure in DFWhole.ts), and
- the pinned-row values actually rendered in the summary grid —
dtype, mean, std,
min, median, max, most_freq..5th_freq, non_null_count, null_count,
unique_count, distinct_count (whatever the active StylingAnalysis.pinned_rows reference).
Everything else — value_counts (a whole pd.Series), histogram_args, memory_usage,
the is_* typing flags, the heuristic *_frac cleaning stats — is shipped to the browser and
never read.
Impact
Measured payload (parquet, then +33% as base64 on the wire):
| frame |
all_stats parquet |
as b64 |
| 5k rows × 20 cols |
332 KB |
443 KB |
| 5k rows × 191 cols |
3.2 MB |
4.3 MB |
Most of that is dead weight. Trimming to the displayed subset shrinks the initial_state /
traitlet payload (the parquet parse in hyparquet scales with payload width and is a real
first-paint cost). It also directly shrinks persisted first-load cache bundles (#877).
Suggested fix
Derive the frontend-needed stat-key set from the active styling classes (pinned_rows + the
histogram-bin needs) and project merged_sd down to just those keys before sd_to_parquet_b64.
Keep the full merged_sd server-side / on the dataflow for styling regeneration and future
consumers; only the wire projection shrinks.
Scope
serialization_utils.py (a projection step before / inside sd_to_parquet_b64),
dataflow/styling_core.py + customizations/styling.py (source of the displayed-key set),
dataflow.py / buckaroo_widget.py (where all_stats is assembled). Widget + server share
the path.
Context
Identified while designing the initial-load cache (#877): the cache must persist all_stats, and
measuring it showed most of the payload is never read by the frontend.
Summary
The
all_statspayload sent to the frontend is built bysd_to_parquet_b64(merged_sd)(
serialization_utils.py:397, wired in atdataflow.py:698,702andbuckaroo_widget.py:364)and serializes the entire
merged_sd— every stat for every column. The frontend onlyconsumes a small subset:
histogram_bins/histogram_log_binsper column, for color-map binning (SDFT/SDFMeasureinDFWhole.ts), anddtype,mean,std,min,median,max,most_freq..5th_freq,non_null_count,null_count,unique_count,distinct_count(whatever the activeStylingAnalysis.pinned_rowsreference).Everything else —
value_counts(a wholepd.Series),histogram_args,memory_usage,the
is_*typing flags, the heuristic*_fraccleaning stats — is shipped to the browser andnever read.
Impact
Measured payload (parquet, then +33% as base64 on the wire):
Most of that is dead weight. Trimming to the displayed subset shrinks the
initial_state/traitlet payload (the parquet parse in
hyparquetscales with payload width and is a realfirst-paint cost). It also directly shrinks persisted first-load cache bundles (#877).
Suggested fix
Derive the frontend-needed stat-key set from the active styling classes (
pinned_rows+ thehistogram-bin needs) and project
merged_sddown to just those keys beforesd_to_parquet_b64.Keep the full
merged_sdserver-side / on the dataflow for styling regeneration and futureconsumers; only the wire projection shrinks.
Scope
serialization_utils.py(a projection step before / insidesd_to_parquet_b64),dataflow/styling_core.py+customizations/styling.py(source of the displayed-key set),dataflow.py/buckaroo_widget.py(whereall_statsis assembled). Widget + server sharethe path.
Context
Identified while designing the initial-load cache (#877): the cache must persist
all_stats, andmeasuring it showed most of the payload is never read by the frontend.