You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#908 and #909 move the expensive xorq stats to fast approximations. The desired end state is size-based: tables at or below ~100k rows get the accurate implementations (exact COUNT(DISTINCT), exact top-10 value counts), larger tables get the approximations.
#909 already dispatches this way because the per-column histogram phase receives length resolved from the batch aggregate. The batch phase itself cannot: batched stat expressions are built before any row count exists — length is computed by the same aggregate query they are folded into (xorq_stat_pipeline.py:374, __total_length__). So after #908, distinct_count is approximate at every table size.
Options
Pre-count before building the batch expressions. table.count() is metadata-cheap on plain parquet scans but is a full plan execution for joins and filter chains; the count could go through the same snapshot cache so it's paid once per expression.
Let @stat functions declare exact/approx variants and have the pipeline pick once length is known: the batch runs the approx variant unconditionally, and a follow-up pass re-runs the exact variants when length <= threshold and updates the summary — cheap by definition at that size.
Summary
#908 and #909 move the expensive xorq stats to fast approximations. The desired end state is size-based: tables at or below ~100k rows get the accurate implementations (exact COUNT(DISTINCT), exact top-10 value counts), larger tables get the approximations.
#909 already dispatches this way because the per-column histogram phase receives
lengthresolved from the batch aggregate. The batch phase itself cannot: batched stat expressions are built before any row count exists —lengthis computed by the same aggregate query they are folded into (xorq_stat_pipeline.py:374,__total_length__). So after #908,distinct_countis approximate at every table size.Options
table.count()is metadata-cheap on plain parquet scans but is a full plan execution for joins and filter chains; the count could go through the same snapshot cache so it's paid once per expression.@statfunctions declare exact/approx variants and have the pipeline pick oncelengthis known: the batch runs the approx variant unconditionally, and a follow-up pass re-runs the exact variants whenlength <= thresholdand updates the summary — cheap by definition at that size.Context
Split out of #906. Until a selection mechanism exists, the fast methods are the default wherever row count is unavailable at expression-build time.