Skip to content

Add tensor allreduce API and host collective lowering#1750

Open
hashiqiqixian wants to merge 1 commit into
hw-native-sys:mainfrom
hashiqiqixian:feat/tensor-allreduce-pr1
Open

Add tensor allreduce API and host collective lowering#1750
hashiqiqixian wants to merge 1 commit into
hw-native-sys:mainfrom
hashiqiqixian:feat/tensor-allreduce-pr1

Conversation

@hashiqiqixian

@hashiqiqixian hashiqiqixian commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds the first-stage infrastructure for tensor-level allreduce in distributed PyPTO.

  • Add public DSL/IR API pld.tensor.allreduce(src, signal, op=pld.ReduceOp.Sum)
  • Add ReduceOp support for distributed tensor collectives, with Sum implemented and Max / Min / Prod reserved for future lowerings
  • Add internal-only builtin.tensor.allreduce op for compiler-generated host collective dispatch
  • Add LowerHostTensorCollectives pass to lower host-level pld.tensor.allreduce into per-device builtin dispatch calls
  • Extend comm-domain materialization so user-provided allreduce signal buffers inherit the data buffer's comm-domain coverage
  • Keep regular non-host pld.tensor.allreduce on the existing composite lowering path
  • Update English and Chinese distributed op docs and add unit coverage

Motivation

This prepares the host-level collective API path without introducing a separate runtime helper or a new user-facing host namespace. Users call the tensor-level API directly:

data = pld.window(data_buf, [N], dtype=pl.FP32)
signal = pld.window(signal_buf, [nranks], dtype=pl.INT32)

data = pld.tensor.allreduce(data, signal, op=pld.ReduceOp.Sum)

For host orchestrators, the compiler keeps the user API as pld.tensor.allreduce and lowers it into internal builtin chip dispatches. The signal tensor remains explicit and user-created, matching the current design direction.

Scope

This PR is PR1 of the staged allreduce work. It covers API, IR, comm-domain handling, pass plumbing, and host lowering infrastructure.

It does not yet include the builtin kernel template/codegen materialization path or real device execution for the host builtin. Those are intended for follow-up PRs.

Testing

  • Added unit tests for pld.tensor.allreduce type deduction and validation
  • Added parser coverage to reject user-visible pld.builtin.*
  • Added tests for host allreduce lowering into builtin.tensor.allreduce
  • Added tests for allreduce signal comm-domain inheritance
  • Added pass-manager ordering coverage
  • Updated distributed op documentation in English and Chinese

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f4945bd6-a564-4b0c-9917-1ed48f5b2019

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds the pld.tensor.allreduce_ distributed collective operation, enabling in-place tensor reduction over communication domains. The implementation includes a new ReduceOp enum, IR operation registrations (public and internal builtin), a host-level lowering pass that converts allreduce calls into builtin dispatches, comprehensive test coverage, and full documentation in English and Chinese.

Changes

Distributed Tensor All-Reduce Feature

Layer / File(s) Summary
ReduceOp enum and metadata infrastructure
include/pypto/ir/comm.h, include/pypto/ir/op_registry.h, python/bindings/modules/ir.cpp, python/pypto/pypto_core/ir.pyi, python/pypto/language/distributed/__init__.py
New ReduceOp enum with kSum variant; new OpRegistryEntry APIs for internal_only and template_dir metadata; Python bindings and stubs expose ReduceOp for distributed collectives; DSL re-exports ReduceOp as pld.ReduceOp.
Allreduce IR operation registration
src/ir/op/distributed/collective.cpp, include/pypto/ir/transforms/pass_properties.h, include/pypto/ir/transforms/passes.h, python/pypto/ir/op/distributed/tensor_ops.py, python/pypto/ir/op/distributed/__init__.py, tests/ut/ir/test_distributed_ops.py
Registers public pld.tensor.allreduce_ and internal-only builtin.tensor.allreduce IR ops with type deduction and validation; validates window-bound distributed tensors, enforces INT32 rank-1 signal, restricts to ReduceOp.Sum + FP32; adds pass properties; IR builder allreduce_ and exports.
Host-level allreduce lowering pass
src/ir/transforms/lower_host_tensor_collectives_pass.cpp, python/bindings/modules/passes.cpp, python/pypto/pypto_core/passes.pyi, tests/ut/ir/transforms/test_lower_host_tensor_collectives.py
Implements LowerHostTensorCollectives transformation that rewrites pld.tensor.allreduce_ calls into builtin.tensor.allreduce dispatches within world-size loops or per-device sequences; validates arguments, maps window buffers to comm-domain scope, checks signal capacity; sets InOut directions for builtin inputs.
Communication domain scopes enhancement
src/ir/transforms/materialize_comm_domain_scopes_pass.cpp, tests/ut/ir/transforms/test_materialize_comm_domain_scopes.py
Updates MaterializeCommDomainScopes to recognize pld.tensor.allreduce_ and propagate comm-domain coverage from data to signal allocations, ensuring signal buffers are included in scope slots.
DSL wrappers and pipeline integration
python/pypto/language/distributed/op/tensor_ops.py, python/pypto/ir/pass_manager.py, tests/ut/ir/parser/test_system_ops.py
Provides language.distributed.allreduce_ wrapper; validates and lowers to IR builder; integrates LowerHostTensorCollectives into tile_pto_passes pipeline (both Default and DebugTileOptimization strategies); adds parser test rejecting user-visible pld.builtin.* ops.
Documentation and build configuration
docs/en/dev/distributed_ops.md, docs/zh-cn/dev/distributed_ops.md, CMakeLists.txt, tests/ut/ir/transforms/test_pass_manager.py
English and Chinese documentation describe the new operator (7 ops, 4 ABI enums), signature, semantics, and current FP32/Sum constraints; CMakeLists adds source files; pass manager tests updated for pipeline changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The PR introduces a complete feature spanning IR contracts, operation registration, a sophisticated transformation pass with window-buffer back-reference tracking and device-binding logic, interactions with an existing pass, DSL wrappers, and comprehensive tests. The lowering pass (lower_host_tensor_collectives_pass.cpp) contains dense control flow for scope resolution and device binding, requiring careful review of the mutation and validation logic. The materialize_comm_domain_scopes integration adds bookkeeping coordination. The breadth across headers, C++, Python bindings, stubs, and DSL layers increases cognitive load.

Possibly related issues

Poem

🐰 A rabbit hops through windows bound,
With signals dancing all around,
All-reduce sums the scattered dance,
Collective whispers, hosts advance!
Enums and lowering pave the way,
Distributed ops now have their day. 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'Add tensor allreduce API and host collective lowering' accurately and concisely summarizes the main changes: introducing the allreduce API and the host-level collective lowering pass.
Description check ✅ Passed The PR description clearly explains the changes: adding tensor-level allreduce API, ReduceOp enum, builtin ops, host lowering pass, and comm-domain materialization, with detailed motivation and scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the pld.tensor.allreduce_ distributed collective operation along with its internal counterpart builtin.tensor.allreduce, exposing them through Python bindings and DSL APIs. It adds the LowerHostTensorCollectives pass to lower host-level all-reduces to builtin chip dispatches, and updates the MaterializeCommDomainScopes pass to propagate comm-domain coverage to signal buffers. Feedback on the changes suggests preserving leading comments during lowering, using the project-standard CHECK macro instead of INTERNAL_CHECK_SPAN for user-facing validation errors, and providing a default value of ReduceOp.Sum for the op parameter in the Python APIs to align with the PR description.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/ir/transforms/lower_host_tensor_collectives_pass.cpp Outdated
Comment thread src/ir/transforms/lower_host_tensor_collectives_pass.cpp Outdated
Comment thread src/ir/transforms/materialize_comm_domain_scopes_pass.cpp Outdated
Comment thread src/ir/transforms/materialize_comm_domain_scopes_pass.cpp Outdated
Comment thread python/pypto/language/distributed/op/tensor_ops.py Outdated
Comment thread python/pypto/ir/op/distributed/tensor_ops.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
python/pypto/language/distributed/op/tensor_ops.py (1)

275-275: ⚡ Quick win

Sort __all__ alphabetically for consistency.

The __all__ list should be sorted alphabetically to match the isort-style convention enforced by Ruff RUF022.

📋 Proposed fix
-__all__ = ["allreduce_", "alloc_window_buffer", "get", "put", "window"]
+__all__ = ["alloc_window_buffer", "allreduce_", "get", "put", "window"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/pypto/language/distributed/op/tensor_ops.py` at line 275, The __all__
export list is not alphabetized; update the __all__ variable in tensor_ops.py so
the names are sorted alphabetically (e.g., allreduce_, alloc_window_buffer, get,
put, window) to satisfy the RUF022/isort convention—locate the __all__
declaration and reorder the string entries into ascending alphabetical order.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/ut/ir/transforms/test_lower_host_tensor_collectives.py`:
- Line 150: The pytest.raises call's match argument currently uses a normal
string with backslashes (match="signal shape\\[0\\].*participating device
count"); change it to a raw string literal (match=r"signal
shape\[0\].*participating device count") so the regex backslashes are
interpreted correctly; update the pytest.raises(...) invocation in
test_lower_host_tensor_collectives.py accordingly.

---

Nitpick comments:
In `@python/pypto/language/distributed/op/tensor_ops.py`:
- Line 275: The __all__ export list is not alphabetized; update the __all__
variable in tensor_ops.py so the names are sorted alphabetically (e.g.,
allreduce_, alloc_window_buffer, get, put, window) to satisfy the RUF022/isort
convention—locate the __all__ declaration and reorder the string entries into
ascending alphabetical order.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9a872435-9c6e-44f1-8c76-ba9127880066

📥 Commits

Reviewing files that changed from the base of the PR and between 09c29e0 and b02c2d8.

📒 Files selected for processing (24)
  • CMakeLists.txt
  • docs/en/dev/distributed_ops.md
  • docs/zh-cn/dev/distributed_ops.md
  • include/pypto/ir/comm.h
  • include/pypto/ir/op_registry.h
  • include/pypto/ir/transforms/pass_properties.h
  • include/pypto/ir/transforms/passes.h
  • python/bindings/modules/ir.cpp
  • python/bindings/modules/passes.cpp
  • python/pypto/ir/op/distributed/__init__.py
  • python/pypto/ir/op/distributed/tensor_ops.py
  • python/pypto/ir/pass_manager.py
  • python/pypto/language/distributed/__init__.py
  • python/pypto/language/distributed/op/tensor_ops.py
  • python/pypto/pypto_core/ir.pyi
  • python/pypto/pypto_core/passes.pyi
  • src/ir/op/distributed/collective.cpp
  • src/ir/transforms/lower_host_tensor_collectives_pass.cpp
  • src/ir/transforms/materialize_comm_domain_scopes_pass.cpp
  • tests/ut/ir/parser/test_system_ops.py
  • tests/ut/ir/test_distributed_ops.py
  • tests/ut/ir/transforms/test_lower_host_tensor_collectives.py
  • tests/ut/ir/transforms/test_materialize_comm_domain_scopes.py
  • tests/ut/ir/transforms/test_pass_manager.py

Comment thread tests/ut/ir/transforms/test_lower_host_tensor_collectives.py Outdated
@hashiqiqixian hashiqiqixian force-pushed the feat/tensor-allreduce-pr1 branch 9 times, most recently from 0d3d22b to 6880307 Compare June 15, 2026 07:29
@hashiqiqixian hashiqiqixian changed the title Add tensor allreduce API and host lowering Add tensor allreduce API and host collective lowering Jun 15, 2026
@hashiqiqixian hashiqiqixian force-pushed the feat/tensor-allreduce-pr1 branch 3 times, most recently from 5909915 to bd38cf4 Compare June 15, 2026 09:17
@hashiqiqixian hashiqiqixian force-pushed the feat/tensor-allreduce-pr1 branch from bd38cf4 to 1e723e0 Compare June 15, 2026 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant