refactor(dataflow): generic frame typing + basedpyright CI job, clear standard-mode backlog (80→46)#928
Open
paddymul wants to merge 9 commits into
Open
refactor(dataflow): generic frame typing + basedpyright CI job, clear standard-mode backlog (80→46)#928paddymul wants to merge 9 commits into
paddymul wants to merge 9 commits into
Conversation
… type Introduce buckaroo/dataflow/df_types.py with a two-tier carrier scheme: FrameT (unbounded) for the abstract base, and DataFrameT (bound to a structural DataFrameLike Protocol: columns / __len__ / __getitem__) for the eager pandas/polars/xorq bodies. There is no nominal base spanning the three backends, so the bound is structural. ABCDataflow becomes Generic[FrameT]; DataFlow / CustomizableDataflow become Generic[DataFrameT]. The frame type now flows through the method boundaries that actually take or return a frame (_compute_sampled_df, _compute_processed_result, _get_summary_sd, _df_to_obj, cleaned_df, processed_df), so each backend reports processed_df in its own type rather than Any. Backends bind concretely: PandasDataflow = CustomizableDataflow[ pd.DataFrame], PolarsDataflow = CustomizableDataflow[pl.DataFrame], XorqDataflow(CustomizableDataflow[XorqExpr]), the server dataflows, and the lazy ColumnExecutorDataflow(ABCDataflow[pl.LazyFrame]). traitlets-backed attributes stay untyped on purpose (a descriptor returns Any on instance access); the generic contract is expressed on the methods. No runtime behaviour change — annotations and a small number of documented casts only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.15.0.dev27638553511or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.15.0.dev27638553511MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.15.0.dev27638553511" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
…ow typing New TypeCheck job in checks.yml runs basedpyright over the nine files touched by the generic-typing work. Two deliberate choices keep the signal usable: - Scoped via pyrightconfig.typecheck.json, which lists exactly those files. Its name is not auto-discovered, so editors keep using the [tool.basedpyright] block in pyproject.toml. - standard mode rather than basedpyright's stricter "recommended" default, so the output is real type errors rather than third-party-stub noise (reportUnknownMemberType / reportAny from pyarrow, xorq, traitlets). continue-on-error: true makes it non-blocking. These files carry a large pre-existing backlog of standard-mode diagnostics (~80, mostly reportPrivateImportUsage from xorq.vendor and pre-existing override mismatches), so the job tracks a baseline to drive down rather than gating merges today. basedpyright is pinned (1.39.8) for reproducible diagnostics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b885d6b to
ff7c620
Compare
…y_sd None path Address the two genuinely-useful diagnostics the generics surfaced (the rest of the scoped basedpyright output is pre-existing backlog): - ABCDataflow declared processed_df as a plain variable annotation, while DataFlow and ColumnExecutorDataflow override it with a read-only @Property. basedpyright flagged both overrides as incompatible ("overrides symbol of same name"). processed_df is a computed, read-only property in every implementation, so declare it that way on the base (abstract property) — a like-for-like override. No runtime assignment to processed_df exists, so the read-only contract is sound. - _summary_sd passed self.processed_df (Optional[DataFrameT]) straight into _get_summary_sd, which requires a non-None DataFrameT. Add the same `if df is None: return` guard the sibling observers _merged_sd and _populate_sd_cache already have. In the normal eager cascade processed_df is non-None when this fires; the guard only affects the degenerate path, bringing the three observers into line. Scoped basedpyright drops 84 -> 81 errors. Full unit suite unchanged (1205 passed; the 3 pre-existing failures are build-artifact/network tests unrelated to this change). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AnyDataFrame had no consumers — it was speculative API surface for a backend-dispatch call site that doesn't exist yet (YAGNI). It was also the only thing referencing pandas/polars/xorq in this module, so removing it lets the now-dead imports (and the TYPE_CHECKING / TypeAlias bits) go too, leaving df_types.py to express just the FrameT / DataFrameT / DataFrameLike contract. No runtime or type-check change: df_types.py stays at 0 basedpyright errors and the scoped total is unchanged at 81. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…BCDataflow ABCDataflow declared its synced/computed wire attributes (summary_sd, merged_sd, df_meta, buckaroo_options, command_config, df_data_dict, df_display_args, operation_results, cleaned_sd, processed_sd) as concrete Dict[str, Any]. But every concrete dataflow backs those names with a traitlets trait (Dict(...) / Any(...)) or a @Property, and a traitlets descriptor returns Any on instance access. The concrete base declaration only fought the descriptor protocol: each subclass trait/property read as an incompatible override (reportIncompatibleVariableOverride), the trait value read as an illegal assignment (reportAssignmentType), and writes like `self.summary_sd = ...` as illegal attribute assignments (reportAttributeAccessIssue). Declaring them Any on the base — what they actually resolve to — removes all of those. Same rationale df_types.py already applies to the frame-carrying methods: type where it's sound, not on the traits. Also dropped the local `buckaroo_options: BuckarooOptions` annotation for the same reason (the descriptor value can't satisfy the TypedDict; the BuckarooOptions shape is kept as documentation and named in the comment), which additionally clears the "Could not assign item in TypedDict" error in populate_auto_clean_options. basedpyright (scoped config, standard mode): 81 -> 57 errors. dataflow.py 37 -> 23; column_executor_dataflow.py 21 -> 11 (inherited cleanup); df_types.py stays at 0; no new errors. No runtime change: the affected suite (dataflow/polars/xorq/widget/column_executor) is 443 passed, 2 skipped, unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ABCDataflow declares analysis_klasses: List[Type[ColAnalysis]], but the bare DataFlow stub defaulted it to None, which doesn't satisfy that type (reportAssignmentType at dataflow.py:108). Default to [] instead so the type holds end-to-end — CustomizableDataflow already narrows it to a real list ([StylingAnalysis]). Behavior-equivalent: nothing relies on a None sentinel. Within the dataflow the attribute is only read via id()-based summary-stats cache keys and the dead "foo"/"bar" string branches in _get_summary_sd (both treat None and [] alike, since neither equals a str); the one external truthiness check is a function parameter where None and [] are both falsy. The bare-pipeline default is also never mutated in place. basedpyright (scoped, standard mode): 57 -> 56; df_types.py stays 0; no new errors. No runtime change: dataflow/polars/xorq/widget/column_executor suite is 443 passed, 2 skipped, unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CustomizableDataflow calls config_dict / add_command / _run_df_interpreter / run_code_generator on self.ac_obj, but pyright infers ac_obj from the bare DataFlow default `autocleaning_klass = SentinelAutocleaning`, which implements only command_config + handle_ops_and_clean — so those four reads were reportAttributeAccessIssue. Add a structural Autocleaning Protocol describing the full interface the pipeline calls, and cast ac_obj to it at construction. SentinelAutocleaning can't satisfy the whole Protocol (a plain annotation would error on the assignment), and the code-interpreter members are only reached via CustomizableDataflow, which always runs on a real backend — so the cast documents that bridge rather than papering ac_obj as Any. basedpyright (scoped, standard mode): 56 -> 52; dataflow.py 22 -> 18; df_types.py stays 0; no new errors. No runtime change: the dataflow/polars/xorq/widget/column_executor suite is 443 passed, 2 skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DataFlow.__init__ reraises self.exception[0..2] when applying raw_df fails, but self.exception starts as None and is only populated to sys.exc_info() by the exception_protect handlers (dataflow_extras) when a trait-observer cascade fails. pyright saw the bare `= None` and flagged the three subscripts as reportOptionalSubscript. Annotate self.exception as Optional[Tuple[Any, Any, Any]] and guard: if no handler stored an exc_info, re-raise the current exception instead of subscripting None (previously a confusing 'NoneType is not subscriptable' would mask the real error). Normal path is unchanged — when exc_info is present it's reraised exactly as before. basedpyright (scoped, standard mode): 52 -> 49; dataflow.py 18 -> 15; df_types.py stays 0. No behavior change on the populated path: the dataflow/polars/xorq/widget/column_executor suite is 443 passed, 2 skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DfTrait.set indexes obj._trait_values[self.name] and calls _notify_trait(self.name, ...), but the traitlets stubs type TraitType.name as Optional[str] (None until the metaclass binds the trait to a class), so each use was reportArgumentType (str | None vs str). Bind `name = cast(str, self.name)` once — it's always set by the time set() runs on an instance — and use it for the lookups and notification. Runtime no-op. basedpyright (scoped, standard mode): 49 -> 46; dataflow.py 15 -> 12; df_types.py stays 0. No behavior change: dataflow/polars/xorq/widget/ column_executor suite is 443 passed, 2 skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Makes the dataflow pipeline generic over its dataframe type, adds a non-blocking CI job that type-checks the affected files with basedpyright, and works the standard-mode backlog that job surfaced down from 80 to 46 errors.
Generic frame typing
New
buckaroo/dataflow/df_types.pywith a two-tier carrier scheme:FrameT(unbounded) — umbrella for the abstract base, spanning the lazyColumnExecutorDataflow(pl.LazyFrame) that can't meet the eager contract.DataFrameT(bound to a structuralDataFrameLikeProtocol:columns/__len__/__getitem__) — the eager pandas/polars/xorq bodies. No nominal base spans the three backends, so the bound is structural.ABCDataflowbecomesGeneric[FrameT];DataFlow/CustomizableDataflowbecomeGeneric[DataFrameT]. The frame type flows through the method boundaries that actually take/return a frame, so each backend reportsprocessed_dfin its own type instead ofAny. Backends bind concretely (PandasDataflow,PolarsDataflow,XorqDataflow, the server dataflows, andColumnExecutorDataflow(ABCDataflow[pl.LazyFrame])).processed_dfis modelled as an abstract read-only property on the base so the@propertyoverrides are a like-for-like override.traitlets-backed attributes stay untyped on purpose (a descriptor returns
Anyon instance access). No runtime behaviour change — annotations plus a few documentedcasts.Non-blocking basedpyright CI job
New
TypeCheckjob runs basedpyright over the nine touched files, scoped viapyrightconfig.typecheck.json. It uses standard mode (not basedpyright's stricterrecommendeddefault) so the signal is real type errors rather than third-party-stub noise, and it'scontinue-on-errorso it never blocks merges. The config's name is not auto-discovered, so editors keep using the existing[tool.basedpyright]block. basedpyright is pinned at 1.39.8.Typing backlog cleanup
With the job in place, the standard-mode diagnostics on these files went from 80 (the
mainbaseline) to 46. Each step is its own commit and is runtime-neutral — the test suite is unchanged at every one:processed_dfas an abstract property +_summary_sdNone guard. Closes the realOptional[DataFrameT]nullability gap the generics surfaced, and resolves the property-override modelling.AnyDataFramealias. Dead speculative API indf_types.py.AnyonABCDataflow. The base declaredsummary_sd/merged_sd/df_meta/buckaroo_options/ … asDict[str, Any], which collided with the trait and@propertydefinitions in subclasses (incompatible-override, illegal-assignment). A traitlets descriptor resolves toAnyon access, so the base now says so. This one change also cleared 10 inherited errors incolumn_executor_dataflow.py.DataFlow.analysis_klassesdefaults to[]notNone. Keeps the typeList[Type[ColAnalysis]]end-to-end.ac_objtyped via anAutocleaningProtocol. The bareSentinelAutocleaningdefault lacks the code-interpreter methodsCustomizableDataflowcalls; a structural Protocol plus a cast at construction documents that bridge.Noneexc_info. Small latent-bug fix: re-raise the current exception instead of subscriptingNonewhen no handler stored one.DfTraittrait name tostrinset().basedpyright results (standard mode, these files)
mainbaseline (8 existing files)df_types.py)df_types.py(the new typing core): 0 errors.main. The generics surfaced a couple of real gaps (now closed); the bulk of the reduction is clearing the pre-existing standard-mode backlog the new job made visible (traitlets/wire-attr modelling, the autocleaning interface, nullability guards, theanalysis_klasses/DfTraittypes).Test
pytest tests/unit -k "dataflow or polars or xorq or widget or column_executor"→ 443 passed, 2 skipped, at every commit. No runtime change.Remaining (follow-ups, not blocking)
The 46 diagnostics still on these files, none blocking:
reportPrivateImportUsagere-export noise —traitlets(Unicode/Any/observe/Dict) andxorq.vendor.*. Suppressible via a config key if re-export noise isn't worth tracking.dataflow.py—Optional[DataFrameT]into non-optional params (a couple may be realNone-handling gaps), andDataFrameTinto helpers typed forpd.DataFrame(pd_to_obj/serialize_sample/DFStatsClass), which want generic/Protocol signatures.SDType/ErrDictTypedDict vsDict[str, Any]mismatches in a couple of signatures.xorq_buckaroo.py(10) /buckaroo_widget.py(9) errors haven't been profiled yet.[tool.basedpyright.analysis.inlayHints]key (which the CLI rejects as unrecognized), and decide whether to grow the scoped file list over time.🤖 Generated with Claude Code