feat(backend): add SuperscalarNPU backend with TREG register-file allocation#1680
feat(backend): add SuperscalarNPU backend with TREG register-file allocation#1680lyfne123 wants to merge 2 commits into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces support for a new SuperscalarNPU backend with a register-file memory model. It expands IR memory abstractions, adds a complete backend implementation with TREG (register file) memory, refactors tile memory-space inference and address allocation passes to use backend-specific on-chip defaults and block-indexed addressing, integrates the backend into the dispatch system, exposes it to Python, documents the architecture, and validates the implementation with comprehensive tests. ChangesSuperscalarNPU Backend Addition
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces the SuperscalarNPU backend, which features a register-file memory model consisting of DDR and a 1MB TREG register file divided into 256 fixed 4KB blocks. This backend lowers tensor operations to tile operations and performs register-renaming TREG allocation, bypassing Ascend-specific passes. Key changes include updating the memory space enum, extending the memory allocator policy to support address units and maximum units, and integrating the new backend into the shared memory-space inference and address allocation passes. A critical review comment points out that during address allocation for register-file spaces, integer division of the byte address by the address unit could silently truncate non-aligned offsets, potentially causing aliasing bugs. The reviewer suggests adding an INTERNAL_CHECK_SPAN to guarantee that the byte address is perfectly divisible by the address unit.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/ir/transforms/infer_tile_memory_space_pass.cpp (1)
871-886:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTileMemoryInferred verifier still checks raw
Vecconstraints instead of backend-mapped constraints.Phase 1/2/3 now map generic
Vecconstraints to backend on-chip default (TREGon SuperscalarNPU), but verifier input-constraint validation still compares against the un-mapped registry list. This can emit false errors on valid SuperscalarNPU output.Proposed fix
class TileMemoryInferredVerifier : public IRVisitor { public: explicit TileMemoryInferredVerifier(std::vector<Diagnostic>& diagnostics, std::string func_name) - : diagnostics_(diagnostics), func_name_(std::move(func_name)) {} + : diagnostics_(diagnostics), func_name_(std::move(func_name)) { + default_onchip_ = MemorySpace::Vec; + if (backend::BackendConfig::IsConfigured()) { + default_onchip_ = backend::GetBackend()->GetHandler()->GetDefaultOnChipMemorySpace(); + } + } @@ private: std::vector<Diagnostic>& diagnostics_; std::string func_name_; + MemorySpace default_onchip_; @@ - bool allowed = std::find(allowed_spaces.begin(), allowed_spaces.end(), actual) != allowed_spaces.end(); + bool allowed = std::any_of(allowed_spaces.begin(), allowed_spaces.end(), + [&](MemorySpace s) { + return MapConstraintSpace(s, default_onchip_) == actual; + });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/ir/transforms/infer_tile_memory_space_pass.cpp` around lines 871 - 886, VerifyInputConstraints currently compares actual memory_space_ against the raw registry Vec returned by GetInputConstraints, which causes false errors because earlier phases map generic Vec constraints to backend-specific on-chip defaults; update VerifyInputConstraints (and the call-site using GetInputConstraints) to first convert/map the returned allowed_spaces into backend-resolved memory spaces via the same mapping used by TileMemoryInferred (use the TileMemoryInferred backend-mapping helper or the function that maps generic Vec constraints to backend-specific MemorySpace values) and then compare actual against that mapped set (i.e., replace direct use of allowed_spaces with the backend-mapped_allowed_spaces before the std::find check).
🧹 Nitpick comments (1)
tests/ut/ir/transforms/test_superscalar_npu_treg.py (1)
104-107: ⚡ Quick winScope the byte-offset guard to TREG allocation lines
Line 107 currently rejects
pl.const(65536, pl.INT64)anywhere in the whole dump, which can fail on unrelated constants. Narrow this check topl.tile.alloc(pl.Mem.TREG, ...)lines only.Suggested tightening
- assert f"pl.const({_BLOCKS_PER_TILE}, pl.INT64), {_TILE_BYTES})" in text - # A byte-addressed second tile would read pl.const(65536, ...); ensure no - # TREG MemRef uses the byte offset as its address. - assert f"pl.const({_TILE_BYTES}, pl.INT64)" not in text + treg_alloc_lines = [ln for ln in text.splitlines() if "pl.tile.alloc(pl.Mem.TREG" in ln] + assert any( + f"pl.const({_BLOCKS_PER_TILE}, pl.INT64), {_TILE_BYTES})" in ln for ln in treg_alloc_lines + ) + # A byte-addressed second tile would read pl.const(65536, ...); ensure this + # does not appear as a TREG MemRef address. + assert all(f"pl.const({_TILE_BYTES}, pl.INT64)" not in ln for ln in treg_alloc_lines)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/ut/ir/transforms/test_superscalar_npu_treg.py` around lines 104 - 107, The test currently asserts that f"pl.const({_TILE_BYTES}, pl.INT64)" does not appear anywhere in the dump; instead restrict this check to TREG allocation lines by scanning only the lines (or substrings) that contain "pl.tile.alloc(pl.Mem.TREG" and asserting that none of those lines include the f"pl.const({_TILE_BYTES}, pl.INT64)" pattern; update the assertion to use a targeted search/regex against "pl.tile.alloc(pl.Mem.TREG" entries rather than the whole `text`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@include/pypto/backend/common/soc.h`:
- Around line 265-267: Update the contradictory SoC doc wording so it
consistently states that the SuperscalarNPU model has no dedicated CUBE core or
Ascend L0/L1 hierarchy and instead uses a single compute core modeled as VECTOR
carrying the TREG memory; specifically, edit the description mentioning "no
cube/vector cores" and the sentence referencing CoreType and VECTOR so both
assert "no dedicated CUBE core" (or "no Ascend L0/L1 buffers") and that the
single compute core is modeled as VECTOR in CoreType.
In `@src/backend/common/backend_registry.cpp`:
- Around line 72-79: Update the thrown ValueError in
RegisterBackendSuperscalarNPU so the guidance includes
SuperscalarNPU::Instance() (i.e., change the message in the throw within
RegisterBackendSuperscalarNPU to mention "use
BackendSuperscalarNPU::Instance()") and similarly update any other registry
error messages that currently reference Backend910B::Instance() or
Backend950::Instance() so they mention the new SuperscalarNPU::Instance() option
(look for the throw ValueError calls in the backend registration functions and
update their strings).
In `@src/ir/transforms/allocate_memory_addr_pass.cpp`:
- Around line 372-374: The division by addr_unit in the stored_addr computation
can collapse distinct byte addresses when byte_addr is not aligned; update the
code around byte_addr/stored_addr (using symbols current_addr, relative_offset,
byte_addr, stored_addr, addr_unit, and ConstInt) to validate alignment before
converting: if addr_unit > 1 assert or check that (byte_addr % addr_unit) == 0
and only then compute stored_addr = byte_addr / addr_unit; otherwise handle the
unaligned case explicitly (e.g., emit an error/guard, convert to a
byte-granularity address path, or use a defined rounding/packing strategy) so
you never silently floor non-aligned byte addresses into the same stored index.
Ensure the chosen behavior is deterministic and documented in the surrounding
logic.
---
Outside diff comments:
In `@src/ir/transforms/infer_tile_memory_space_pass.cpp`:
- Around line 871-886: VerifyInputConstraints currently compares actual
memory_space_ against the raw registry Vec returned by GetInputConstraints,
which causes false errors because earlier phases map generic Vec constraints to
backend-specific on-chip defaults; update VerifyInputConstraints (and the
call-site using GetInputConstraints) to first convert/map the returned
allowed_spaces into backend-resolved memory spaces via the same mapping used by
TileMemoryInferred (use the TileMemoryInferred backend-mapping helper or the
function that maps generic Vec constraints to backend-specific MemorySpace
values) and then compare actual against that mapped set (i.e., replace direct
use of allowed_spaces with the backend-mapped_allowed_spaces before the
std::find check).
---
Nitpick comments:
In `@tests/ut/ir/transforms/test_superscalar_npu_treg.py`:
- Around line 104-107: The test currently asserts that f"pl.const({_TILE_BYTES},
pl.INT64)" does not appear anywhere in the dump; instead restrict this check to
TREG allocation lines by scanning only the lines (or substrings) that contain
"pl.tile.alloc(pl.Mem.TREG" and asserting that none of those lines include the
f"pl.const({_TILE_BYTES}, pl.INT64)" pattern; update the assertion to use a
targeted search/regex against "pl.tile.alloc(pl.Mem.TREG" entries rather than
the whole `text`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ccd9bed6-68b5-4551-81f4-c8e23186de22
📒 Files selected for processing (29)
CMakeLists.txtdocs/en/dev/backend/00-backend_handler.mddocs/en/dev/ir/01-hierarchy.mddocs/zh-cn/dev/backend/00-backend_handler.mddocs/zh-cn/dev/ir/01-hierarchy.mdinclude/pypto/backend/common/backend.hinclude/pypto/backend/common/backend_handler.hinclude/pypto/backend/common/soc.hinclude/pypto/backend/superscalar/backend_superscalar_npu.hinclude/pypto/backend/superscalar/backend_superscalar_npu_handler.hinclude/pypto/backend/superscalar/register_allocator_policy.hinclude/pypto/ir/memory_allocator_policy.hinclude/pypto/ir/memory_space.hpython/bindings/modules/backend.cpppython/bindings/modules/ir.cpppython/pypto/backend/__init__.pypython/pypto/ir/pass_manager.pypython/pypto/pypto_core/backend.pyipython/pypto/pypto_core/ir.pyisrc/backend/common/backend.cppsrc/backend/common/backend_registry.cppsrc/backend/common/soc.cppsrc/backend/superscalar/backend_superscalar_npu.cppsrc/backend/superscalar/backend_superscalar_npu_handler.cppsrc/ir/memref.cppsrc/ir/transforms/allocate_memory_addr_pass.cppsrc/ir/transforms/infer_tile_memory_space_pass.cpptests/ut/backend/test_backend_superscalar_npu.pytests/ut/ir/transforms/test_superscalar_npu_treg.py
- AllocateMemoryAddr: guard register-file block-index conversion with an
INTERNAL_CHECK_SPAN that the byte address is unit-aligned before dividing,
so a non-aligned offset can no longer silently truncate and alias distinct
addresses to the same block index (CodeRabbit, gemini-code-assist).
- soc.h: fix contradictory SuperscalarNPU SoC doc wording ("no cube/vector
cores" vs "modelled as VECTOR") — now "no dedicated CUBE core / no Ascend
L1/L0 hierarchy; single compute core modelled as VECTOR" (CodeRabbit).
- backend_registry.cpp: include BackendSuperscalarNPU::Instance() in the
registry singleton-guidance error messages (CodeRabbit).
- tests: order imports as a single group matching the repo convention
(pytest < pypto < pypto.*), fixing the pre-commit isort failure.
…ocation Introduce a new SuperscalarNPU backend whose only memory spaces are DDR and TREG, a register file of 256 fixed 4KB blocks (1MB) addressed by block index. The backend lowers tensor ops to tile ops and allocates TREG via register renaming; code generation is not implemented yet (IR-only). Memory space: - Add MemorySpace::TREG (enum, string maps, Python binding, type stub, BNF docs). Register-renaming allocation: - Extend MemoryAllocatorPolicy with AddressUnitBytes (default 1) and MaxAddressUnits (default unbounded). AllocateMemoryAddr now stores block indices for register-file spaces (unit > 1), reserves consecutive blocks for tiles larger than one block, and raises a user-facing error when more than the available blocks are simultaneously live (register pressure). The AllocatedMemoryAddr verifier scales stored addresses by the unit so the byte-based capacity check is unit-agnostic. - Add SuperscalarNPURegisterAllocatorPolicy (4KB blocks, 256-block limit). Liveness-based coalescing is still performed by the shared MemoryReuse pass. Backend registration: - Add BackendType::SuperscalarNPU, CreateSuperscalarNPUSoC (DDR + 1MB TREG), BackendSuperscalarNPU + SuperscalarNPUHandler, factory/registry wiring, CMake sources, Python bindings, and type stubs. Pass pipeline: - Add OptimizationStrategy.SuperscalarNPU: tensor prefix + outline + ConvertTensorToTileOps + InferTileMemorySpace + InitMemRef + MemoryReuse + AllocateMemoryAddr, omitting all Ascend cube/vector/cross-core passes. - Add BackendHandler::GetDefaultOnChipMemorySpace (Vec on Ascend, TREG here) so InferTileMemorySpace targets each backend's on-chip storage without branching on BackendType; a generic Vec op input-constraint is realized as this default so no spurious tile.move is inserted. Ascend behavior is unchanged. Tests: - Backend construction/SoC/handler tests and an end-to-end TREG strategy test (TREG assignment, block-index addressing, register renaming, pressure error). Docs (en + zh-cn): BackendHandler reference and IR hierarchy BNF.
- AllocateMemoryAddr: guard register-file block-index conversion with an
INTERNAL_CHECK_SPAN that the byte address is unit-aligned before dividing,
so a non-aligned offset can no longer silently truncate and alias distinct
addresses to the same block index (CodeRabbit, gemini-code-assist).
- soc.h: fix contradictory SuperscalarNPU SoC doc wording ("no cube/vector
cores" vs "modelled as VECTOR") — now "no dedicated CUBE core / no Ascend
L1/L0 hierarchy; single compute core modelled as VECTOR" (CodeRabbit).
- backend_registry.cpp: include BackendSuperscalarNPU::Instance() in the
registry singleton-guidance error messages (CodeRabbit).
- tests: order imports as a single group matching the repo convention
(pytest < pypto < pypto.*), fixing the pre-commit isort failure.
f72039b to
1f40641
Compare
Summary
Adds a new SuperscalarNPU backend whose only memory spaces are DDR and TREG — a register file of 256 fixed 4 KB blocks (1 MB), addressed by block index. The backend lowers tensor ops to tile ops and allocates TREG via register renaming. Code generation is not implemented yet; this is an IR-only backend.
Memory space
MemorySpace::TREGacross the enum, string maps, Python binding, type stub, and IR-hierarchy BNF.Register-renaming allocation
MemoryAllocatorPolicywithAddressUnitBytes(default 1) andMaxAddressUnits(default unbounded).AllocateMemoryAddrstores block indices for register-file spaces (unit > 1), reserves consecutive blocks for tiles larger than one block, and raises a user-facing error when more blocks are simultaneously live than exist (register pressure). TheAllocatedMemoryAddrverifier scales stored addresses by the unit so the byte-based capacity check stays unit-agnostic.SuperscalarNPURegisterAllocatorPolicy(4 KB blocks, 256-block limit). Liveness-based coalescing is still done by the sharedMemoryReusepass.Backend registration
BackendType::SuperscalarNPU,CreateSuperscalarNPUSoC(DDR + 1 MB TREG),BackendSuperscalarNPU+SuperscalarNPUHandler, factory/registry wiring, CMake sources, Python bindings, and type stubs.Pass pipeline
OptimizationStrategy.SuperscalarNPU: tensor prefix + outline +ConvertTensorToTileOps+InferTileMemorySpace+InitMemRef+MemoryReuse+AllocateMemoryAddr, omitting all Ascend cube/vector/cross-core passes.BackendHandler::GetDefaultOnChipMemorySpace(Vec on Ascend, TREG here) soInferTileMemorySpacetargets each backend's on-chip storage without branching onBackendType; a genericVecop input-constraint is realized as this default so no spurioustile.moveis inserted. Ascend behavior is unchanged (its default is Vec, so all new code paths are no-ops there).Testing
tests/ut/ir/transforms/(1550 passed),tests/ut/codegen/(408 passed), allocate/reuse/infer suites — Ascend unaffected.Notes
CoreTypeonly offersCUBE/VECTOR; the single SuperscalarNPU compute core reusesVECTOR(noted in the SoC factory).