fix(codegen): materialize Scalar tuple-return elements in orchestration (#631)#1625
fix(codegen): materialize Scalar tuple-return elements in orchestration (#631)#1625Little-oil wants to merge 8 commits into
Conversation
…on (hw-native-sys#631) When an incore kernel returns a tuple mixing Tensor and Scalar elements, the orchestration codegen mapped each return position to an Out/InOut param. A Scalar element (e.g. an index `o0 = ob * 32` computed from an input param) is not backed by an Out param, so it resolved to no param index and was dropped — the generated C++ then referenced an undefined variable (or, after a stop-gap, a wrong `o0 = 0`), breaking compilation or producing incorrect runtime values. Re-materialize such Scalar elements at the call site: trace the callee's ReturnStmt value at that position, inline its local scalar definitions, substitute each callee param with the expression actually passed for it, and render via the existing GenerateExprString path (e.g. `int64_t o0 = (ob * 32);`). When the element cannot be traced to a pure arithmetic function of the call args, fall back to the prior 0 default so generated code stays compilable. Applied to both the plain tuple-return and pl.submit return-alias paths.
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds support for rematerializing scalar tuple elements from callee-defined expressions instead of unconditionally emitting ChangesScalar Tuple Return Rematerialization
Sequence Diagram(s)sequenceDiagram
participant TryRenderCalleeScalarReturn
participant CalleeFinder
participant DefnCollector
participant ParamMapper
participant Inliner
participant ExprRenderer
TryRenderCalleeScalarReturn->>CalleeFinder: locate callee ReturnStmt
TryRenderCalleeScalarReturn->>DefnCollector: collect scalar AssignStmts in callee
TryRenderCalleeScalarReturn->>ParamMapper: map callee params to call->args_
TryRenderCalleeScalarReturn->>Inliner: recursively inline local defs<br/>(with cycle guard)
Inliner->>Inliner: substitute params in expr
TryRenderCalleeScalarReturn->>ExprRenderer: GenerateExprString(resolved expr)
ExprRenderer-->>TryRenderCalleeScalarReturn: C++ expression or NotImplementedError
TryRenderCalleeScalarReturn-->>TryRenderCalleeScalarReturn: return expression or nullopt
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces the TryRenderCalleeScalarReturn method in the orchestration codegen to inline and re-materialize scalar return-tuple elements at the call site based on the callee's defining expression and call-site arguments, resolving issue #631. It also adds a unit test to verify this behavior. The reviewer suggests that the order-insensitive collection of variable definitions in ScalarDefCollector can be incorrect and recommends using a prefix-based approach that accumulates definitions as statements are scanned to handle variable availability and dominance correctly.
| void VisitStmt_(const AssignStmtPtr& assign) override { | ||
| if (assign->var_) var_def.emplace(assign->var_.get(), assign->value_); | ||
| IRVisitor::VisitStmt_(assign); | ||
| } |
There was a problem hiding this comment.
When performing variable definition analysis, an order-insensitive collection of all definitions can be incorrect. Instead of using a simple map emplace/insert which doesn't respect execution flow, use a prefix-based approach that accumulates definitions as statements are scanned to correctly handle variable availability and dominance.
void VisitStmt_(const AssignStmtPtr& assign) override {
if (assign->var_) {
accumulated_defs.push_back({assign->var_.get(), assign->value_});
}
IRVisitor::VisitStmt_(assign);
}References
- When performing variable definition analysis, an order-insensitive collection of all definitions can be incorrect. Instead, use a prefix-based approach that accumulates definitions as statements are scanned to correctly handle variable availability and dominance.
…hw-native-sys#631) The Scalar tuple-return materialization renders the resolved return expression via GenerateExprString. That path only handles a fixed set of scalar kinds (consts, arithmetic, min/max, comparisons, cast, tuple-get); a resolved expression containing an unsupported kind — e.g. a Call to a scalar-returning helper — makes it throw NotImplementedError, which aborted orchestration codegen for kernels that compiled fine before (seen on DeepSeek-V4 decode_attention_swa: "GenerateExprString not implemented for expression type: Call"). Catch NotImplementedError in TryRenderCalleeScalarReturn and fall back to std::nullopt, so the caller emits the prior `0` default instead of aborting. Materialization is thus best-effort: it fires for pure arithmetic-of-args scalars and is a no-op (old behavior) for anything GenerateExprString cannot render.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/ut/codegen/test_orchestration_codegen.py (1)
775-838: ⚡ Quick winAdd a companion regression for the
pl.submitreturn-alias path.This test validates the plain tuple-return rematerialization path well, but the PR also updates the
pl.submitreturn-alias path. Please add a focusedpl.submitvariant (tuple[Tensor, Scalar]) to lock coverage for both touched paths.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/ut/codegen/test_orchestration_codegen.py` around lines 775 - 838, Add a companion test that mirrors test_scalar_tuple_return_materialized but exercises the pl.submit return-alias path: create a new test function (e.g., test_scalar_tuple_return_materialized_submit) that defines the same ScalarTupleReturnProgram (or reuse producer/consumer) and in the Orchestration function call the producer via pl.submit (e.g., t, off = pl.submit(self.producer, a, base, mid) or the framework's submit syntax) so the codegen path for pl.submit is exercised; generate code with _generate_orch_code and assert the same expectations as the original test (presence of "int64_t off = (base * 32);", absence of "int64_t off = 0;", "const Tensor& t = ext_mid;", and "add_scalar(off)") to lock coverage for the submit return-alias variant.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/ut/codegen/test_orchestration_codegen.py`:
- Around line 775-838: Add a companion test that mirrors
test_scalar_tuple_return_materialized but exercises the pl.submit return-alias
path: create a new test function (e.g.,
test_scalar_tuple_return_materialized_submit) that defines the same
ScalarTupleReturnProgram (or reuse producer/consumer) and in the Orchestration
function call the producer via pl.submit (e.g., t, off =
pl.submit(self.producer, a, base, mid) or the framework's submit syntax) so the
codegen path for pl.submit is exercised; generate code with _generate_orch_code
and assert the same expectations as the original test (presence of "int64_t off
= (base * 32);", absence of "int64_t off = 0;", "const Tensor& t = ext_mid;",
and "add_scalar(off)") to lock coverage for the submit return-alias variant.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 425042b0-4ea6-402c-b5e2-90a27d2bcd3b
📒 Files selected for processing (2)
src/codegen/orchestration/orchestration_codegen.cpptests/ut/codegen/test_orchestration_codegen.py
…s#631) The hw-native-sys#631 fix re-materializes Scalar tuple-return elements in both the plain tuple-return path (GenerateTupleReturnAliases) and the pl.submit path (GenerateSubmitReturnAliases), but only the plain path had a regression test. Add test_scalar_tuple_return_materialized_submit, a focused pl.submit variant returning tuple[Tensor, Scalar] inside a pl.manual_scope, asserting the scalar is inlined to "int64_t off = (base * 32);" (no "off = 0" stop-gap) and forwarded downstream via add_scalar(off).
…ative-sys#631)" This reverts commit 137328914 (the added test_scalar_tuple_return_materialized_submit). The test mis-unpacked pl.submit: it used `t, off, _tid = pl.submit(...)`, but the parser requires exactly 2 targets (result, task_id) — the callee's tuple return is the single `result`, not inline-unpackable — so the test raised ParserSyntaxError and failed unit-tests. There is no proven tuple-returning-callee pl.submit pattern in the suite to mirror, and the submit return-alias path already exercises the same TryRenderCalleeScalarReturn helper that the passing plain-path test (test_scalar_tuple_return_materialized) covers, so dropping this unverified test restores green CI without losing meaningful coverage. A dedicated submit-path test can be added once the tuple-result submit DSL form is verified on a build.
…ing (hw-native-sys#631) End-to-end system test reproducing the issue hw-native-sys#631 scenario: an InCore producer returns tuple[Tensor, Scalar[INDEX]] where the Scalar (off = base*2) is computed inside the kernel and is not backed by any Out param, and the orchestration forwards that Scalar to a downstream InCore consumer. The consumer uses the forwarded Scalar as its valid column count (valid_shapes=[rows, off]), so the output's valid region depends on the materialized Scalar value: before the fix the Scalar was dropped (undefined variable -> kernel failed to compile); a zero/wrong fallback would zero or shift the valid region. With the fix the Scalar is re-materialized at the call site (int64_t off = (base * 2)) and the result matches the golden. Complements the codegen UT test_scalar_tuple_return_materialized with a runtime (compile + execute) check across all platforms.
…ventions (hw-native-sys#631) Restructure the hw-native-sys#631 system test to follow the sibling ST style (test_dyn_orch_shape.py): the test case constructor now takes (shape, base), the pytest suite parametrizes over (shape, base) x platform, get_name embeds the config, and module-level constants are replaced by a _SHAPE_BASE config list. The DSL program and golden are unchanged (still device-validated).
…m style (hw-native-sys#631) Match the test_set_validshape.py convention: the @pl.program is defined at module level (not nested in get_program), shapes/config are module-level constants (M, N, BASE, VALID_COLS), the golden is a module-level _expected() helper, and the test case only references them with the pytest suite parametrized over platform. DSL program and golden semantics unchanged.
Summary
When an incore kernel returns a tuple that mixes
TensorandScalarelements, the orchestration codegen mapped each return position to anOut/InOutparam. AScalarelement (e.g. an indexo0 = ob * 32computed from an input param) is not backed by anOutparam, so it resolved to no param index and was dropped. The generated C++ then referenced an undefined variable (or, after an earlier stop-gap, a wrongo0 = 0), breaking compilation or producing incorrect runtime values.This is the full fix: the orchestration codegen now materializes
Scalartuple-return elements at the call site (e.g. emittingint64_t o0 = (ob * 32);) instead of dropping them. It traces the callee'sReturnStmtvalue at that position, inlines its local scalar definitions, substitutes each callee param with the expression actually passed for it, and renders via the existingGenerateExprStringpath. When an element cannot be traced to a pure arithmetic function of the call args, it falls back to the prior0default so generated code stays compilable. The fix is applied on both the plain tuple-return path and thepl.submitreturn-alias path.Addresses #631
Test plan
tests/ut/codegen/test_orchestration_codegen.py::test_scalar_tuple_return_materialized, which verifies the generated orchestration C++ materializes the Scalar tuple-return element instead of dropping it.task-submiton device 9. CI does not run device ST, so this is not exercised automatically here.