Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Summary
I would be happy to work on a fix for this issue. Please let me know if someone is already working on it so we can avoid duplicate effort.
Orchestration fanin/dependency dedup currently performs a linear scan over already-collected producers for each emitted dependency. In workloads with many tensor args, group submits, overlapping TensorMap hits, or repeated producers, this can become O(N^2) in the number of fanin candidates. This should likely use a set-like membership structure while preserving the existing producer order.
Affected areas found on simpler/main:
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
src/common/hierarchical/orchestrator.cpp
Git Commit ID
607f78a
CANN Version
No response
Driver Version
No response
Host Platform
Linux (aarch64)
Reproduction
# Inspect L2 fanin dedup:
grep -n -A 25 "bool contains" \
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
grep -n -A 25 "bool contains" \
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
# Inspect hierarchical orchestration dedup:
grep -n -A 20 "add_unique_producer" \
src/common/hierarchical/orchestrator.cpp
Expected Performance
Dependency/fanin dedup should be approximately O(N) for N emitted producer candidates, using constant-time membership checks such as an unordered_set or a runtime-friendly fixed/open-addressing set.
Actual Performance
Current code performs a linear membership scan for every candidate producer. Worst-case dedup cost is O(N^2) when many dependencies are emitted during orchestration lookup.
Profiling Data (Optional)
No response
Additional Context
This was first noticed while implementing the feature in
hw-native-sys/pypto#1545. Before adding dummy task
nodes, the dependency pattern could expand from N producers and M
consumers into N*M dependency edges, and orchestration-side fanin lookup /
dedup became a significant cost.
The dummy task approach reduces that dependency shape from N*M to N+M,
which makes the issue much less severe in that specific path, but it does not
remove the underlying O(K^2) dedup behavior. When enough submitted tasks depend
on the same preceding n tasks, the runtime still repeatedly emits candidate
producers and checks each one by linearly scanning the already-collected fanin
list. As the number of dependent task x submissions grows, this can still
become noticeably expensive.
I would be happy to work on a fix for this issue. Please let me know if someone is already working on it so we can avoid duplicate effort.
Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Summary
Orchestration fanin/dependency dedup currently performs a linear scan over already-collected producers for each emitted dependency. In workloads with many tensor args, group submits, overlapping TensorMap hits, or repeated producers, this can become O(N^2) in the number of fanin candidates. This should likely use a set-like membership structure while preserving the existing producer order.
Affected areas found on
simpler/main:src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cppsrc/common/hierarchical/orchestrator.cppGit Commit ID
607f78a
CANN Version
No response
Driver Version
No response
Host Platform
Linux (aarch64)
Reproduction
Expected Performance
Dependency/fanin dedup should be approximately O(N) for N emitted producer candidates, using constant-time membership checks such as an unordered_set or a runtime-friendly fixed/open-addressing set.
Actual Performance
Current code performs a linear membership scan for every candidate producer. Worst-case dedup cost is O(N^2) when many dependencies are emitted during orchestration lookup.
Profiling Data (Optional)
No response
Additional Context
This was first noticed while implementing the feature in
hw-native-sys/pypto#1545. Before adding dummy task
nodes, the dependency pattern could expand from
Nproducers andMconsumers into
N*Mdependency edges, and orchestration-side fanin lookup /dedup became a significant cost.
The dummy task approach reduces that dependency shape from
N*MtoN+M,which makes the issue much less severe in that specific path, but it does not
remove the underlying O(K^2) dedup behavior. When enough submitted tasks depend
on the same preceding
ntasks, the runtime still repeatedly emits candidateproducers and checks each one by linearly scanning the already-collected fanin
list. As the number of dependent
task xsubmissions grows, this can stillbecome noticeably expensive.