Add tensor allreduce API and host collective lowering by hashiqiqixian · Pull Request #1750 · hw-native-sys/pypto

hashiqiqixian · 2026-06-11T09:03:18Z

Summary

This PR adds the first-stage infrastructure for tensor-level allreduce in distributed PyPTO.

Add public DSL/IR API pld.tensor.allreduce(src, signal, op=pld.ReduceOp.Sum)
Add ReduceOp support for distributed tensor collectives, with Sum implemented and Max / Min / Prod reserved for future lowerings
Add internal-only builtin.tensor.allreduce op for compiler-generated host collective dispatch
Add LowerHostTensorCollectives pass to lower host-level pld.tensor.allreduce into per-device builtin dispatch calls
Extend comm-domain materialization so user-provided allreduce signal buffers inherit the data buffer's comm-domain coverage
Keep regular non-host pld.tensor.allreduce on the existing composite lowering path
Update English and Chinese distributed op docs and add unit coverage

Motivation

This prepares the host-level collective API path without introducing a separate runtime helper or a new user-facing host namespace. Users call the tensor-level API directly:

data = pld.window(data_buf, [N], dtype=pl.FP32)
signal = pld.window(signal_buf, [nranks], dtype=pl.INT32)

data = pld.tensor.allreduce(data, signal, op=pld.ReduceOp.Sum)

For host orchestrators, the compiler keeps the user API as pld.tensor.allreduce and lowers it into internal builtin chip dispatches. The signal tensor remains explicit and user-created, matching the current design direction.

Scope

This PR is PR1 of the staged allreduce work. It covers API, IR, comm-domain handling, pass plumbing, and host lowering infrastructure.

It does not yet include the builtin kernel template/codegen materialization path or real device execution for the host builtin. Those are intended for follow-up PRs.

Testing

Added unit tests for pld.tensor.allreduce type deduction and validation
Added parser coverage to reject user-visible pld.builtin.*
Added tests for host allreduce lowering into builtin.tensor.allreduce
Added tests for allreduce signal comm-domain inheritance
Added pass-manager ordering coverage
Updated distributed op documentation in English and Chinese

coderabbitai · 2026-06-11T09:03:32Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f4945bd6-a564-4b0c-9917-1ed48f5b2019

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds the pld.tensor.allreduce_ distributed collective operation, enabling in-place tensor reduction over communication domains. The implementation includes a new ReduceOp enum, IR operation registrations (public and internal builtin), a host-level lowering pass that converts allreduce calls into builtin dispatches, comprehensive test coverage, and full documentation in English and Chinese.

Changes

Distributed Tensor All-Reduce Feature

Layer / File(s)	Summary
ReduceOp enum and metadata infrastructure `include/pypto/ir/comm.h`, `include/pypto/ir/op_registry.h`, `python/bindings/modules/ir.cpp`, `python/pypto/pypto_core/ir.pyi`, `python/pypto/language/distributed/__init__.py`	New `ReduceOp` enum with `kSum` variant; new `OpRegistryEntry` APIs for `internal_only` and `template_dir` metadata; Python bindings and stubs expose `ReduceOp` for distributed collectives; DSL re-exports `ReduceOp` as `pld.ReduceOp`.
Allreduce IR operation registration `src/ir/op/distributed/collective.cpp`, `include/pypto/ir/transforms/pass_properties.h`, `include/pypto/ir/transforms/passes.h`, `python/pypto/ir/op/distributed/tensor_ops.py`, `python/pypto/ir/op/distributed/__init__.py`, `tests/ut/ir/test_distributed_ops.py`	Registers public `pld.tensor.allreduce_` and internal-only `builtin.tensor.allreduce` IR ops with type deduction and validation; validates window-bound distributed tensors, enforces INT32 rank-1 signal, restricts to `ReduceOp.Sum` + FP32; adds pass properties; IR builder `allreduce_` and exports.
Host-level allreduce lowering pass `src/ir/transforms/lower_host_tensor_collectives_pass.cpp`, `python/bindings/modules/passes.cpp`, `python/pypto/pypto_core/passes.pyi`, `tests/ut/ir/transforms/test_lower_host_tensor_collectives.py`	Implements `LowerHostTensorCollectives` transformation that rewrites `pld.tensor.allreduce_` calls into `builtin.tensor.allreduce` dispatches within world-size loops or per-device sequences; validates arguments, maps window buffers to comm-domain scope, checks signal capacity; sets `InOut` directions for builtin inputs.
Communication domain scopes enhancement `src/ir/transforms/materialize_comm_domain_scopes_pass.cpp`, `tests/ut/ir/transforms/test_materialize_comm_domain_scopes.py`	Updates `MaterializeCommDomainScopes` to recognize `pld.tensor.allreduce_` and propagate comm-domain coverage from data to signal allocations, ensuring signal buffers are included in scope slots.
DSL wrappers and pipeline integration `python/pypto/language/distributed/op/tensor_ops.py`, `python/pypto/ir/pass_manager.py`, `tests/ut/ir/parser/test_system_ops.py`	Provides `language.distributed.allreduce_` wrapper; validates and lowers to IR builder; integrates `LowerHostTensorCollectives` into `tile_pto_passes` pipeline (both Default and DebugTileOptimization strategies); adds parser test rejecting user-visible `pld.builtin.*` ops.
Documentation and build configuration `docs/en/dev/distributed_ops.md`, `docs/zh-cn/dev/distributed_ops.md`, `CMakeLists.txt`, `tests/ut/ir/transforms/test_pass_manager.py`	English and Chinese documentation describe the new operator (7 ops, 4 ABI enums), signature, semantics, and current FP32/Sum constraints; CMakeLists adds source files; pass manager tests updated for pipeline changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The PR introduces a complete feature spanning IR contracts, operation registration, a sophisticated transformation pass with window-buffer back-reference tracking and device-binding logic, interactions with an existing pass, DSL wrappers, and comprehensive tests. The lowering pass (lower_host_tensor_collectives_pass.cpp) contains dense control flow for scope resolution and device binding, requiring careful review of the mutation and validation logic. The materialize_comm_domain_scopes integration adds bookkeeping coordination. The breadth across headers, C++, Python bindings, stubs, and DSL layers increases cognitive load.

Possibly related issues

[Feature] Add orchestration-level collective communication operators to PyPTO compiler #1189: Both PRs add host-level all-reduce support (pld.tensor.allreduce_ and ReduceOp enum) and IR-level lowering for distributed collectives on the same operations and infrastructure.

Poem

🐰 A rabbit hops through windows bound,
With signals dancing all around,
All-reduce sums the scattered dance,
Collective whispers, hosts advance!
Enums and lowering pave the way,
Distributed ops now have their day. 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title 'Add tensor allreduce API and host collective lowering' accurately and concisely summarizes the main changes: introducing the allreduce API and the host-level collective lowering pass.
Description check	✅ Passed	The PR description clearly explains the changes: adding tensor-level allreduce API, ReduceOp enum, builtin ops, host lowering pass, and comm-domain materialization, with detailed motivation and scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces the pld.tensor.allreduce_ distributed collective operation along with its internal counterpart builtin.tensor.allreduce, exposing them through Python bindings and DSL APIs. It adds the LowerHostTensorCollectives pass to lower host-level all-reduces to builtin chip dispatches, and updates the MaterializeCommDomainScopes pass to propagate comm-domain coverage to signal buffers. Feedback on the changes suggests preserving leading comments during lowering, using the project-standard CHECK macro instead of INTERNAL_CHECK_SPAN for user-facing validation errors, and providing a default value of ReduceOp.Sum for the op parameter in the Python APIs to align with the PR description.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

python/pypto/language/distributed/op/tensor_ops.py (1)

275-275: ⚡ Quick win

Sort __all__ alphabetically for consistency.

The __all__ list should be sorted alphabetically to match the isort-style convention enforced by Ruff RUF022.

📋 Proposed fix

-__all__ = ["allreduce_", "alloc_window_buffer", "get", "put", "window"]
+__all__ = ["alloc_window_buffer", "allreduce_", "get", "put", "window"]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/pypto/language/distributed/op/tensor_ops.py` at line 275, The __all__
export list is not alphabetized; update the __all__ variable in tensor_ops.py so
the names are sorted alphabetically (e.g., allreduce_, alloc_window_buffer, get,
put, window) to satisfy the RUF022/isort convention—locate the __all__
declaration and reorder the string entries into ascending alphabetical order.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/ut/ir/transforms/test_lower_host_tensor_collectives.py`:
- Line 150: The pytest.raises call's match argument currently uses a normal
string with backslashes (match="signal shape\\[0\\].*participating device
count"); change it to a raw string literal (match=r"signal
shape\[0\].*participating device count") so the regex backslashes are
interpreted correctly; update the pytest.raises(...) invocation in
test_lower_host_tensor_collectives.py accordingly.

---

Nitpick comments:
In `@python/pypto/language/distributed/op/tensor_ops.py`:
- Line 275: The __all__ export list is not alphabetized; update the __all__
variable in tensor_ops.py so the names are sorted alphabetically (e.g.,
allreduce_, alloc_window_buffer, get, put, window) to satisfy the RUF022/isort
convention—locate the __all__ declaration and reorder the string entries into
ascending alphabetical order.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9a872435-9c6e-44f1-8c76-ba9127880066

📥 Commits

Reviewing files that changed from the base of the PR and between 09c29e0 and b02c2d8.

📒 Files selected for processing (24)

CMakeLists.txt
docs/en/dev/distributed_ops.md
docs/zh-cn/dev/distributed_ops.md
include/pypto/ir/comm.h
include/pypto/ir/op_registry.h
include/pypto/ir/transforms/pass_properties.h
include/pypto/ir/transforms/passes.h
python/bindings/modules/ir.cpp
python/bindings/modules/passes.cpp
python/pypto/ir/op/distributed/__init__.py
python/pypto/ir/op/distributed/tensor_ops.py
python/pypto/ir/pass_manager.py
python/pypto/language/distributed/__init__.py
python/pypto/language/distributed/op/tensor_ops.py
python/pypto/pypto_core/ir.pyi
python/pypto/pypto_core/passes.pyi
src/ir/op/distributed/collective.cpp
src/ir/transforms/lower_host_tensor_collectives_pass.cpp
src/ir/transforms/materialize_comm_domain_scopes_pass.cpp
tests/ut/ir/parser/test_system_ops.py
tests/ut/ir/test_distributed_ops.py
tests/ut/ir/transforms/test_lower_host_tensor_collectives.py
tests/ut/ir/transforms/test_materialize_comm_domain_scopes.py
tests/ut/ir/transforms/test_pass_manager.py

github-project-automation Bot added this to pto project Jun 11, 2026

gemini-code-assist Bot reviewed Jun 11, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread tests/ut/ir/transforms/test_lower_host_tensor_collectives.py Outdated

hashiqiqixian force-pushed the feat/tensor-allreduce-pr1 branch 9 times, most recently from 0d3d22b to 6880307 Compare June 15, 2026 07:29

hashiqiqixian changed the title ~~Add tensor allreduce API and host lowering~~ Add tensor allreduce API and host collective lowering Jun 15, 2026

hashiqiqixian force-pushed the feat/tensor-allreduce-pr1 branch 3 times, most recently from 5909915 to bd38cf4 Compare June 15, 2026 09:17

feat(distributed): Add tensor allreduce IR lowering

1e723e0

hashiqiqixian force-pushed the feat/tensor-allreduce-pr1 branch from bd38cf4 to 1e723e0 Compare June 15, 2026 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tensor allreduce API and host collective lowering#1750

Add tensor allreduce API and host collective lowering#1750
hashiqiqixian wants to merge 1 commit into
hw-native-sys:mainfrom
hashiqiqixian:feat/tensor-allreduce-pr1

hashiqiqixian commented Jun 11, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hashiqiqixian commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Scope

Testing

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hashiqiqixian commented Jun 11, 2026 •

edited

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading