Skip to content

~5.7GB retained RSS loading an 18.45M-row table whose string column is near-unique (escapes the #906/#907 fixes) #924

Description

@paddymul

Problem

On 0.14.19, /load_expr for an 18.45M-row table with a near-unique string column (spree_id, ~18.45M distinct, ~12 chars) grew the server from 337MB to 6,044MB RSS — and it stays there after stats complete. The same server loaded a 23.9M-row table with repeating high-cardinality strings (plate, state, county) at +172MB, so the #908 approx_nunique and #909 histogram-sampling fixes hold for the repeating shape; the distinct≈rows shape escapes them. Suspect whatever path still materializes per-distinct-value state (top-values/value_counts before sampling kicks in, or the exact/approx dispatch treating the column as already-small-enough).

Repro: any ~18M-row frame with a per-row-unique string id column → POST /load_expr → compare RSS before/after; memory is not released afterwards.

Suggested fix

Cap distinct-dependent stats by distinct-count estimate, not just row count — a cheap approx_distinct probe before the batch can route distinct≈rows columns to the sampled/early-exit path. Separately, whatever holds the transient post-stats should release it (see companion issue on session retention).

Context

Found during tallyman prompt-pack run pack01 (2026-06-11), buckaroo 0.14.19, via tallyman's companion. Related: #906, #907 (closed), #911 (size-based exact/approx selection), #920 (perf/memory smoke testing — this column shape is a good test case).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions