Skip to content

[Performance] Orchestration fanin dedup uses O(N^2) linear scans during TensorMap lookup #1013

@Leaf-Salix

Description

@Leaf-Salix

Platform

a2a3 (Ascend 910B/C hardware)

Runtime Variant

tensormap_and_ringbuffer

Summary

I would be happy to work on a fix for this issue. Please let me know if someone is already working on it so we can avoid duplicate effort.

Orchestration fanin/dependency dedup currently performs a linear scan over already-collected producers for each emitted dependency. In workloads with many tensor args, group submits, overlapping TensorMap hits, or repeated producers, this can become O(N^2) in the number of fanin candidates. This should likely use a set-like membership structure while preserving the existing producer order.

Affected areas found on simpler/main:

  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
  • src/common/hierarchical/orchestrator.cpp

Git Commit ID

607f78a

CANN Version

No response

Driver Version

No response

Host Platform

Linux (aarch64)

Reproduction

# Inspect L2 fanin dedup:
grep -n -A 25 "bool contains" \
  src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp

grep -n -A 25 "bool contains" \
  src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp

# Inspect hierarchical orchestration dedup:
grep -n -A 20 "add_unique_producer" \
  src/common/hierarchical/orchestrator.cpp

Expected Performance

Dependency/fanin dedup should be approximately O(N) for N emitted producer candidates, using constant-time membership checks such as an unordered_set or a runtime-friendly fixed/open-addressing set.

Actual Performance

Current code performs a linear membership scan for every candidate producer. Worst-case dedup cost is O(N^2) when many dependencies are emitted during orchestration lookup.

Profiling Data (Optional)

No response

Additional Context

This was first noticed while implementing the feature in
hw-native-sys/pypto#1545. Before adding dummy task
nodes, the dependency pattern could expand from N producers and M
consumers into N*M dependency edges, and orchestration-side fanin lookup /
dedup became a significant cost.

The dummy task approach reduces that dependency shape from N*M to N+M,
which makes the issue much less severe in that specific path, but it does not
remove the underlying O(K^2) dedup behavior. When enough submitted tasks depend
on the same preceding n tasks, the runtime still repeatedly emits candidate
producers and checks each one by linearly scanning the already-collected fanin
list. As the number of dependent task x submissions grows, this can still
become noticeably expensive.

I would be happy to work on a fix for this issue. Please let me know if someone is already working on it so we can avoid duplicate effort.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance regression or optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions