Skip to content

choose exact vs approximate summary stats by row count (exact at or below ~100k rows) #911

Description

@paddymul

Summary

#908 and #909 move the expensive xorq stats to fast approximations. The desired end state is size-based: tables at or below ~100k rows get the accurate implementations (exact COUNT(DISTINCT), exact top-10 value counts), larger tables get the approximations.

#909 already dispatches this way because the per-column histogram phase receives length resolved from the batch aggregate. The batch phase itself cannot: batched stat expressions are built before any row count exists — length is computed by the same aggregate query they are folded into (xorq_stat_pipeline.py:374, __total_length__). So after #908, distinct_count is approximate at every table size.

Options

  • Pre-count before building the batch expressions. table.count() is metadata-cheap on plain parquet scans but is a full plan execution for joins and filter chains; the count could go through the same snapshot cache so it's paid once per expression.
  • Let @stat functions declare exact/approx variants and have the pipeline pick once length is known: the batch runs the approx variant unconditionally, and a follow-up pass re-runs the exact variants when length <= threshold and updates the summary — cheap by definition at that size.
  • Leave it: at <=100k rows HLL's ~1% error is rarely visible in a stats panel, and the histogram (the user-visible stat) is already exact below the threshold via fix(stats): sample the categorical histogram input above 100k rows #909.

Context

Split out of #906. Until a selection mechanism exists, the fast methods are the default wherever row count is unavailable at expression-build time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions