Skip to content

refactor(dataflow): generic frame typing + basedpyright CI job, clear standard-mode backlog (80→46)#928

Open
paddymul wants to merge 9 commits into
mainfrom
refactor/dataflow-generic-typing
Open

refactor(dataflow): generic frame typing + basedpyright CI job, clear standard-mode backlog (80→46)#928
paddymul wants to merge 9 commits into
mainfrom
refactor/dataflow-generic-typing

Conversation

@paddymul

@paddymul paddymul commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

What

Makes the dataflow pipeline generic over its dataframe type, adds a non-blocking CI job that type-checks the affected files with basedpyright, and works the standard-mode backlog that job surfaced down from 80 to 46 errors.

Generic frame typing

New buckaroo/dataflow/df_types.py with a two-tier carrier scheme:

  • FrameT (unbounded) — umbrella for the abstract base, spanning the lazy ColumnExecutorDataflow (pl.LazyFrame) that can't meet the eager contract.
  • DataFrameT (bound to a structural DataFrameLike Protocol: columns / __len__ / __getitem__) — the eager pandas/polars/xorq bodies. No nominal base spans the three backends, so the bound is structural.

ABCDataflow becomes Generic[FrameT]; DataFlow / CustomizableDataflow become Generic[DataFrameT]. The frame type flows through the method boundaries that actually take/return a frame, so each backend reports processed_df in its own type instead of Any. Backends bind concretely (PandasDataflow, PolarsDataflow, XorqDataflow, the server dataflows, and ColumnExecutorDataflow(ABCDataflow[pl.LazyFrame])). processed_df is modelled as an abstract read-only property on the base so the @property overrides are a like-for-like override.

traitlets-backed attributes stay untyped on purpose (a descriptor returns Any on instance access). No runtime behaviour change — annotations plus a few documented casts.

Non-blocking basedpyright CI job

New TypeCheck job runs basedpyright over the nine touched files, scoped via pyrightconfig.typecheck.json. It uses standard mode (not basedpyright's stricter recommended default) so the signal is real type errors rather than third-party-stub noise, and it's continue-on-error so it never blocks merges. The config's name is not auto-discovered, so editors keep using the existing [tool.basedpyright] block. basedpyright is pinned at 1.39.8.

Typing backlog cleanup

With the job in place, the standard-mode diagnostics on these files went from 80 (the main baseline) to 46. Each step is its own commit and is runtime-neutral — the test suite is unchanged at every one:

  • processed_df as an abstract property + _summary_sd None guard. Closes the real Optional[DataFrameT] nullability gap the generics surfaced, and resolves the property-override modelling.
  • Drop the unused AnyDataFrame alias. Dead speculative API in df_types.py.
  • traitlets wire attributes → Any on ABCDataflow. The base declared summary_sd / merged_sd / df_meta / buckaroo_options / … as Dict[str, Any], which collided with the trait and @property definitions in subclasses (incompatible-override, illegal-assignment). A traitlets descriptor resolves to Any on access, so the base now says so. This one change also cleared 10 inherited errors in column_executor_dataflow.py.
  • DataFlow.analysis_klasses defaults to [] not None. Keeps the type List[Type[ColAnalysis]] end-to-end.
  • ac_obj typed via an Autocleaning Protocol. The bare SentinelAutocleaning default lacks the code-interpreter methods CustomizableDataflow calls; a structural Protocol plus a cast at construction documents that bridge.
  • Guard the exception reraise against a None exc_info. Small latent-bug fix: re-raise the current exception instead of subscripting None when no handler stored one.
  • Narrow the DfTrait trait name to str in set().

basedpyright results (standard mode, these files)

errors warnings
main baseline (8 existing files) 80 6
this branch (9 files, incl. new df_types.py) 46 6
  • df_types.py (the new typing core): 0 errors.
  • Net: −34 vs main. The generics surfaced a couple of real gaps (now closed); the bulk of the reduction is clearing the pre-existing standard-mode backlog the new job made visible (traitlets/wire-attr modelling, the autocleaning interface, nullability guards, the analysis_klasses / DfTrait types).

Test

pytest tests/unit -k "dataflow or polars or xorq or widget or column_executor" → 443 passed, 2 skipped, at every commit. No runtime change.

Remaining (follow-ups, not blocking)

The 46 diagnostics still on these files, none blocking:

  • reportPrivateImportUsage re-export noisetraitlets (Unicode / Any / observe / Dict) and xorq.vendor.*. Suppressible via a config key if re-export noise isn't worth tracking.
  • Frame arg-type gaps in dataflow.pyOptional[DataFrameT] into non-optional params (a couple may be real None-handling gaps), and DataFrameT into helpers typed for pd.DataFrame (pd_to_obj / serialize_sample / DFStatsClass), which want generic/Protocol signatures.
  • SDType / ErrDict TypedDict vs Dict[str, Any] mismatches in a couple of signatures.
  • The remaining xorq_buckaroo.py (10) / buckaroo_widget.py (9) errors haven't been profiled yet.
  • Fold the existing [tool.basedpyright.analysis.inlayHints] key (which the CLI rejects as unrecognized), and decide whether to grow the scoped file list over time.

🤖 Generated with Claude Code

… type

Introduce buckaroo/dataflow/df_types.py with a two-tier carrier scheme:
FrameT (unbounded) for the abstract base, and DataFrameT (bound to a
structural DataFrameLike Protocol: columns / __len__ / __getitem__) for the
eager pandas/polars/xorq bodies. There is no nominal base spanning the three
backends, so the bound is structural.

ABCDataflow becomes Generic[FrameT]; DataFlow / CustomizableDataflow become
Generic[DataFrameT]. The frame type now flows through the method boundaries
that actually take or return a frame (_compute_sampled_df,
_compute_processed_result, _get_summary_sd, _df_to_obj, cleaned_df,
processed_df), so each backend reports processed_df in its own type rather
than Any. Backends bind concretely: PandasDataflow = CustomizableDataflow[
pd.DataFrame], PolarsDataflow = CustomizableDataflow[pl.DataFrame],
XorqDataflow(CustomizableDataflow[XorqExpr]), the server dataflows, and the
lazy ColumnExecutorDataflow(ABCDataflow[pl.LazyFrame]).

traitlets-backed attributes stay untyped on purpose (a descriptor returns Any
on instance access); the generic contract is expressed on the methods. No
runtime behaviour change — annotations and a small number of documented casts
only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.15.0.dev27638553511

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.15.0.dev27638553511

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.15.0.dev27638553511" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

…ow typing

New TypeCheck job in checks.yml runs basedpyright over the nine files touched
by the generic-typing work. Two deliberate choices keep the signal usable:

- Scoped via pyrightconfig.typecheck.json, which lists exactly those files.
  Its name is not auto-discovered, so editors keep using the
  [tool.basedpyright] block in pyproject.toml.
- standard mode rather than basedpyright's stricter "recommended" default, so
  the output is real type errors rather than third-party-stub noise
  (reportUnknownMemberType / reportAny from pyarrow, xorq, traitlets).

continue-on-error: true makes it non-blocking. These files carry a large
pre-existing backlog of standard-mode diagnostics (~80, mostly
reportPrivateImportUsage from xorq.vendor and pre-existing override
mismatches), so the job tracks a baseline to drive down rather than gating
merges today. basedpyright is pinned (1.39.8) for reproducible diagnostics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@paddymul paddymul force-pushed the refactor/dataflow-generic-typing branch from b885d6b to ff7c620 Compare June 16, 2026 13:05
…y_sd None path

Address the two genuinely-useful diagnostics the generics surfaced (the
rest of the scoped basedpyright output is pre-existing backlog):

- ABCDataflow declared processed_df as a plain variable annotation, while
  DataFlow and ColumnExecutorDataflow override it with a read-only
  @Property. basedpyright flagged both overrides as incompatible
  ("overrides symbol of same name"). processed_df is a computed,
  read-only property in every implementation, so declare it that way on
  the base (abstract property) — a like-for-like override. No runtime
  assignment to processed_df exists, so the read-only contract is sound.

- _summary_sd passed self.processed_df (Optional[DataFrameT]) straight
  into _get_summary_sd, which requires a non-None DataFrameT. Add the
  same `if df is None: return` guard the sibling observers _merged_sd and
  _populate_sd_cache already have. In the normal eager cascade
  processed_df is non-None when this fires; the guard only affects the
  degenerate path, bringing the three observers into line.

Scoped basedpyright drops 84 -> 81 errors. Full unit suite unchanged
(1205 passed; the 3 pre-existing failures are build-artifact/network
tests unrelated to this change).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AnyDataFrame had no consumers — it was speculative API surface for a
backend-dispatch call site that doesn't exist yet (YAGNI). It was also
the only thing referencing pandas/polars/xorq in this module, so removing
it lets the now-dead imports (and the TYPE_CHECKING / TypeAlias bits) go
too, leaving df_types.py to express just the FrameT / DataFrameT /
DataFrameLike contract. No runtime or type-check change: df_types.py
stays at 0 basedpyright errors and the scoped total is unchanged at 81.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…BCDataflow

ABCDataflow declared its synced/computed wire attributes (summary_sd,
merged_sd, df_meta, buckaroo_options, command_config, df_data_dict,
df_display_args, operation_results, cleaned_sd, processed_sd) as concrete
Dict[str, Any]. But every concrete dataflow backs those names with a
traitlets trait (Dict(...) / Any(...)) or a @Property, and a traitlets
descriptor returns Any on instance access. The concrete base declaration
only fought the descriptor protocol: each subclass trait/property read as
an incompatible override (reportIncompatibleVariableOverride), the trait
value read as an illegal assignment (reportAssignmentType), and writes
like `self.summary_sd = ...` as illegal attribute assignments
(reportAttributeAccessIssue).

Declaring them Any on the base — what they actually resolve to — removes
all of those. Same rationale df_types.py already applies to the
frame-carrying methods: type where it's sound, not on the traits. Also
dropped the local `buckaroo_options: BuckarooOptions` annotation for the
same reason (the descriptor value can't satisfy the TypedDict; the
BuckarooOptions shape is kept as documentation and named in the comment),
which additionally clears the "Could not assign item in TypedDict" error
in populate_auto_clean_options.

basedpyright (scoped config, standard mode): 81 -> 57 errors. dataflow.py
37 -> 23; column_executor_dataflow.py 21 -> 11 (inherited cleanup);
df_types.py stays at 0; no new errors. No runtime change: the affected
suite (dataflow/polars/xorq/widget/column_executor) is 443 passed,
2 skipped, unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ABCDataflow declares analysis_klasses: List[Type[ColAnalysis]], but the
bare DataFlow stub defaulted it to None, which doesn't satisfy that type
(reportAssignmentType at dataflow.py:108). Default to [] instead so the
type holds end-to-end — CustomizableDataflow already narrows it to a real
list ([StylingAnalysis]).

Behavior-equivalent: nothing relies on a None sentinel. Within the
dataflow the attribute is only read via id()-based summary-stats cache
keys and the dead "foo"/"bar" string branches in _get_summary_sd (both
treat None and [] alike, since neither equals a str); the one external
truthiness check is a function parameter where None and [] are both
falsy. The bare-pipeline default is also never mutated in place.

basedpyright (scoped, standard mode): 57 -> 56; df_types.py stays 0; no
new errors. No runtime change: dataflow/polars/xorq/widget/column_executor
suite is 443 passed, 2 skipped, unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CustomizableDataflow calls config_dict / add_command / _run_df_interpreter
/ run_code_generator on self.ac_obj, but pyright infers ac_obj from the
bare DataFlow default `autocleaning_klass = SentinelAutocleaning`, which
implements only command_config + handle_ops_and_clean — so those four
reads were reportAttributeAccessIssue.

Add a structural Autocleaning Protocol describing the full interface the
pipeline calls, and cast ac_obj to it at construction. SentinelAutocleaning
can't satisfy the whole Protocol (a plain annotation would error on the
assignment), and the code-interpreter members are only reached via
CustomizableDataflow, which always runs on a real backend — so the cast
documents that bridge rather than papering ac_obj as Any.

basedpyright (scoped, standard mode): 56 -> 52; dataflow.py 22 -> 18;
df_types.py stays 0; no new errors. No runtime change: the
dataflow/polars/xorq/widget/column_executor suite is 443 passed, 2 skipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
paddymul and others added 2 commits June 16, 2026 14:14
DataFlow.__init__ reraises self.exception[0..2] when applying raw_df
fails, but self.exception starts as None and is only populated to
sys.exc_info() by the exception_protect handlers (dataflow_extras) when a
trait-observer cascade fails. pyright saw the bare `= None` and flagged
the three subscripts as reportOptionalSubscript.

Annotate self.exception as Optional[Tuple[Any, Any, Any]] and guard: if no
handler stored an exc_info, re-raise the current exception instead of
subscripting None (previously a confusing 'NoneType is not subscriptable'
would mask the real error). Normal path is unchanged — when exc_info is
present it's reraised exactly as before.

basedpyright (scoped, standard mode): 52 -> 49; dataflow.py 18 -> 15;
df_types.py stays 0. No behavior change on the populated path: the
dataflow/polars/xorq/widget/column_executor suite is 443 passed, 2 skipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DfTrait.set indexes obj._trait_values[self.name] and calls
_notify_trait(self.name, ...), but the traitlets stubs type
TraitType.name as Optional[str] (None until the metaclass binds the trait
to a class), so each use was reportArgumentType (str | None vs str).

Bind `name = cast(str, self.name)` once — it's always set by the time
set() runs on an instance — and use it for the lookups and notification.
Runtime no-op.

basedpyright (scoped, standard mode): 49 -> 46; dataflow.py 15 -> 12;
df_types.py stays 0. No behavior change: dataflow/polars/xorq/widget/
column_executor suite is 443 passed, 2 skipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@paddymul paddymul changed the title refactor(dataflow): generic frame typing + non-blocking basedpyright CI job refactor(dataflow): generic frame typing + basedpyright CI job, clear standard-mode backlog (80→46) Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant