[Performance] Orchestration fanin dedup uses O(N^2) linear scans during TensorMap lookup

### Platform

a2a3 (Ascend 910B/C hardware)

### Runtime Variant

tensormap_and_ringbuffer

### Summary

> I would be happy to work on a fix for this issue. Please let me know if someone is already working on it so we can avoid duplicate effort.

Orchestration fanin/dependency dedup currently performs a linear scan over already-collected producers for each emitted dependency. In workloads with many tensor args, group submits, overlapping TensorMap hits, or repeated producers, this can become O(N^2) in the number of fanin candidates. This should likely use a set-like membership structure while preserving the existing producer order.

Affected areas found on `simpler/main`:

- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`
- `src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`
- `src/common/hierarchical/orchestrator.cpp`

### Git Commit ID

607f78a775eee8f8437a1444ab2dec3e01c62b1c

### CANN Version

_No response_

### Driver Version

_No response_

### Host Platform

Linux (aarch64)

### Reproduction

```bash
# Inspect L2 fanin dedup:
grep -n -A 25 "bool contains" \
  src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp

grep -n -A 25 "bool contains" \
  src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp

# Inspect hierarchical orchestration dedup:
grep -n -A 20 "add_unique_producer" \
  src/common/hierarchical/orchestrator.cpp
```

### Expected Performance

Dependency/fanin dedup should be approximately O(N) for N emitted producer candidates, using constant-time membership checks such as an unordered_set or a runtime-friendly fixed/open-addressing set.

### Actual Performance

Current code performs a linear membership scan for every candidate producer. Worst-case dedup cost is O(N^2) when many dependencies are emitted during orchestration lookup.

### Profiling Data (Optional)

_No response_

### Additional Context

This was first noticed while implementing the feature in
https://github.com/hw-native-sys/pypto/pull/1545. Before adding dummy task
nodes, the dependency pattern could expand from `N` producers and `M`
consumers into `N*M` dependency edges, and orchestration-side fanin lookup /
dedup became a significant cost.

The dummy task approach reduces that dependency shape from `N*M` to `N+M`,
which makes the issue much less severe in that specific path, but it does not
remove the underlying O(K^2) dedup behavior. When enough submitted tasks depend
on the same preceding `n` tasks, the runtime still repeatedly emits candidate
producers and checks each one by linearly scanning the already-collected fanin
list. As the number of dependent `task x` submissions grows, this can still
become noticeably expensive.

> I would be happy to work on a fix for this issue. Please let me know if someone is already working on it so we can avoid duplicate effort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Orchestration fanin dedup uses O(N^2) linear scans during TensorMap lookup #1013

Platform

Runtime Variant

Summary

Git Commit ID

CANN Version

Driver Version

Host Platform

Reproduction

Expected Performance

Actual Performance

Profiling Data (Optional)

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Performance] Orchestration fanin dedup uses O(N^2) linear scans during TensorMap lookup #1013

Description

Platform

Runtime Variant

Summary

Git Commit ID

CANN Version

Driver Version

Host Platform

Reproduction

Expected Performance

Actual Performance

Profiling Data (Optional)

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions