[Bug] AICore VEC UB-not-aligned (507018) on step3p5 single-rank, while qwen3/32b passes on the same chip

# [Bug] AICore VEC UB-not-aligned (507018) on step3p5 single-rank, while qwen3/32b passes on the same chip

**Repository**: `hw-native-sys/simpler` (primary) — likely also touches
`hw-native-sys/pypto` or `hw-native-sys/ptoas` codegen; refile to the right
component on triage.

**Filed as**: [simpler#1036](https://github.com/hw-native-sys/simpler/issues/1036) (2026-06-12)

**Severity**: blocking — Phase 15 single-rank NPU bring-up cannot proceed,
and any TP=N follow-up will hit the same fault per-rank.

> **2026-06-15 update — kernel pinned to `full_head_gate` (task 11), not
> `full_rope_kv_cache`.** A subsequent `P15_DISPATCH_LIMIT` bisect that
> comments out late `rt_submit_*_task(K, …)` calls in the generated
> `chip_orch.cpp` before the .so build flipped the result at exactly
> `LIMIT=11`:
>
> | `P15_DISPATCH_LIMIT` | dispatched tasks | result |
> |---|---|---|
> | 6  | 0..6 (rmsnorm → rope) | **PASS** |
> | 10 | 0..10 (incl. fa_fused 4-spmd block, no head_gate) | **PASS** |
> | **11** | + `full_head_gate` | **FAIL 507018** |
> | 12, 14, 22 | progressively more | FAIL (same signature) |
>
> The `tslot:6` field in plog is an FFTS+ internal slot, NOT the
> `chip_orch` task index — the apparent match to "task 6 =
> `full_rope_kv_cache`" was a misread that this issue's first version
> propagated. The actual culprit is dispatched at chip_orch task 11 (AIV)
> = the per-rank head-wise sigmoid gate. This also matches the local
> `TASK-30 full_head_gate AIV0 stall` entry in our backlog (CLAUDE.md);
> 507018 single-rank and TASK-30 are the same bug.
>
> **Reproducer pinned more precisely**: `pypto-lib/tools/p15_trace/run_with_trace.py`
> with `P15_DISPATCH_LIMIT=10` PASSes, `P15_DISPATCH_LIMIT=11` FAILs. The
> trace harness sits at the `compile_single_orchestration` chokepoint;
> no simpler-runtime rebuild required.
>
> **What this changes for the maintainer ask below**: the disassembly
> request is now scoped to the dispatched function for `full_head_gate`
> (AIV, source `pypto-lib/models/step3p5/attention_full.py:564-586` —
> outer `pl.spmd(BATCH // BATCH_TILE) + pl.range(NUM_HEADS_FULL_LOCAL)`
> head-wise sigmoid then assemble into `attn_out_gated`), not the rope
> body. The rest of this issue's diagnosis (executor binary identification,
> qwen3 counter-example, version pin matrix, what we tried) remains valid.

> **Supersedes the operating hypothesis in `pypto-1702-followup.md`** (filed
> 2026-06-10 as pypto#1738 = "PR#1718 doesn't fix 507018 — SSA aliasing in
> another path"). After this session's eight Phase A model-side mitigations
> all failed to shift the fault hash, the SSA-aliasing theory is now ruled
> out. The crashing kernel binary is shown below to be **simpler runtime's
> own AICore polling-dispatch executor** (binSize 140920 = exact match for
> `simpler/build/lib/a2a3/onboard/tensormap_and_ringbuffer/aicore_kernel.o`),
> so the fix path moves out of pypto codegen and into simpler/PTOAS.

## Summary

Running step3p5's `step3p5_decode -p a2a3 -d 0 --no-smoke --dummy-weights`
crashes the chip ~22 ms after the first kernel dispatch with:

- Host: `aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018`
- Plog: `errcode 0x800 errorStr: The UB address accessed by the VEC
  instruction is not aligned. ... subErrType:4, tslot:6`
- `fault kernel_name=aicore_kernel_0_mix_aic, hash=15033215677169261682,
  binSize=140920`

That `binSize=140920` matches **exactly** the simpler runtime's compiled
`build/lib/a2a3/onboard/tensormap_and_ringbuffer/aicore_kernel.o` (140920
bytes). The "fault kernel" is the simpler-runtime polling-dispatch executor
itself; the genuinely faulting VEC instruction lives in a function dispatched
via `payload->function_bin_addr` from `execute_task` (FFTS+ MIX stream,
`tslot:6`). CANN's `PrintErrorInfoForDavinciTask` reports the entry-point
binary, not the dispatched function, so the *executor* hash is constant
across all model-side mitigations we tried.

A counter-example in the same repo passes cleanly on the same chip / CANN /
simpler / pto-isa / PTOAS: see "Counter-example" below.

## Reproducer

Single command from a `--no-smoke --dummy-weights` invocation against the
current `pypto-lib/models/step3p5/` working tree:

```bash
# venv with pypto + simpler + pto-isa + ptoas installed
cd /path/to/pypto-lib
python -m models.step3p5.step3p5_decode \
    -p a2a3 -d 0 --no-smoke --dummy-weights
```

Observed (truncated):

```
[chip_process pid=… dev=0] ready
[ERROR] sync_run_streams: [device_runner_base.cpp:877]
        aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
[ERROR] recover_device_or_mark_unusable: [device_runner.cpp:456]
        Device unrecoverable after AICore error 507018:
        aclrtSynchronizeDeviceWithTimeout failed: 507015.
RuntimeError: WorkerThread::dispatch_process: child failed (code=1):
              chip_process dev=0: RuntimeError:
              run_prepared failed with code 507018
```

Plog (from `ASCEND_PROCESS_LOG_PATH=$LOGDIR` then `find $LOGDIR -name
'plog-*.log'`):

```
AllKernelRegister: Runtime_alloc_size 1240, type=0,
    kernel_name=aicore_kernel_0_mix_aic, tilingkey=0, offset=144,
    length=232, dfxAddr=0x0, dfxSize=0, kernelVfType=0, shareMemSize=0.
LaunchKernelWithHandle: kernel info : device_id=0, stream_id=45, task_id=0,
    kernelType=0, kernel_name=aicore_kernel_0_mix_aic,
    arg_size=8, mixType=3, taskRation=2, funcType=0,
    addr1=0x124000000090, addr2=0x12400000076c, flag=0, kernelFlag=0x0,
    qos=0, partId=0, schemMode=1, infoAddr=(nil), atomicIndex=0.
FillFftsMixSqeForDavinciTask: kernelNames_=aicore_kernel_0_mix_aic,
    stackSize=32768.
PrintCoreInfo: The extend info: errcode:(0, 0x800, 0)
    errorStr: The UB address accessed by the VEC instruction is not aligned.
    fixp_error0 info: 0xcc175f7, fixp_error1 info: 0x8e,
    fsmId:0, tslot:6, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.
PrintErrorInfoForDavinciTask: Aicore kernel execute failed, device_id=0,
    stream_id=45, report_stream_id=45, task_id=0, flip_num=0,
    fault kernel_name=aicore_kernel_0_mix_aic, fault kernel info ext=none,
    program id=0, hash=15033215677169261682.
GetBinAndKernelNameExceptionArgs: kernel_name=aicore_kernel_0_mix_aic,
    kernelNameSize=23, binSize=140920.
```

A subsequent `PrintAicpuErrorInfo: funcName=simpler_aicpu_exec,
errorCode=0x2a` appears ~1900 ms later; this is a **cascade** from the
chip's fault state, not a primary cause (verified by timestamp ordering).

## Counter-example

The qwen3-32b decode reference in the same `pypto-lib`, with the same
single-card harness, passes end-to-end in ~20 seconds:

```bash
cd /path/to/pypto-lib
python -m models.qwen3.32b.qwen3_32b_decode -p a2a3 -d 0
# [RUN] PASS (20.15s)
# [RUN]   'out' PASS  shape=(16, 8192) dtype=torch.bfloat16
```

Same chip / CANN / simpler / pto-isa / PTOAS / Python. The chip + runtime
stack itself is healthy. step3p5 specifically triggers the fault.

## What is at tslot:6 in step3p5

Reading the compiled `next_levels/chip_orch/orchestration/chip_orch.cpp`,
task dispatch index 6 is `full_rope_kv_cache` — a per-batch AIV scope that
applies partial RoPE (rotary_dim=64 of head_dim=128, pass-through tail=64)
and writes K/V cache + a padded Q block. The relevant Python source is
`pypto-lib/models/step3p5/attention_full.py:387-484` (scope 2).

## Mapping addr to the fault PC

`addr1=0x124000000090` and `addr2=0x12400000076c` are the chip-side virtual
addresses of the executor entry. Plog `PrintCoreInfo` reports
`pc current: 0x12c0c00d9d9c`, which is ~0xc00d9d0c bytes past `addr1`. The
executor binary is only 140920 bytes (= 0x22678), so the PC is **not** in
the executor — it is in a dispatched function loaded elsewhere in chip
address space (`function_bin_addr` from `PTO2DispatchPayload`). We were
unable to disassemble device-side binaries from inside the container (no
gdb / objdump for AICore), so we cannot localise the misaligned VEC at the
instruction level. **Asking the maintainer team to do this localisation is
the primary purpose of this report.**

## Version pin table

| Component | HEAD | Notes |
|---|---|---|
| Chip | Ascend 910B2C (`Short_SoC_version=Ascend910B`) | `dav-c220-cube` / `dav-c220-vec`, 24 AIC / 48 AIV / 6 AICPU per die |
| Driver | `npu-smi 25.5.1` | |
| CANN | `9.0.0-beta.1` | Also reproduced on `8.5.1` (verified live `libhccl.so` load) |
| Python | 3.11.14 | Reproduced on 3.10 too |
| simpler | branch `fix/tensor-zero-size-view-bounds:0cd317e7` (= PR #1023 plus host-side `--no-as-needed` link patch + `comm_hccl.cpp` P2P best-effort) | Also reproduces on `main:afb5c5a9` |
| pto-isa | `main:109c9f72` | |
| PTOAS | binary `v0.44` (source `main:29a8af28`) | Also reproduces on `v0.43` |
| pypto | `main:0f4881cb` (post PR#1718 merge) | |
| pypto-lib | `main:9c5593fb` + step3p5 working-tree WIP | qwen3/32b passes against the same SHA |

The constancy of `hash=15033215677169261682, binSize=140920,
addr1=0x124000000090, addr2=0x12400000076c, tslot:6` across all eight
model-side mitigations below (which change every named step3p5 kernel) is
decisive evidence that the faulting kernel is **not** in step3p5 Python
source — it is in code reachable through simpler's polling-dispatch
executor.

## What we tried (all leave the failure unchanged)

1. **Phase A split-scope refactor of `full_fa_fused`** into 4 sequential
   spmds (`full_qk_matmul` AIC → `full_softmax` AIV → `full_sv_matmul` AIC →
   `full_online_softmax` AIV), mirroring qwen3/32b's pattern. Eliminates the
   mixed AIC+AIV dispatch entirely. `chip_orch.cpp` verified — no remaining
   `MixedKernels` groups. → same hash, same tslot:6.
2. **Split `full_out_proj`** into pure-cube matmul + pure-vec cast via FP32
   GM scratch handoff. → same hash.
3. **Split `dense_gate_up_silu_tp` and `dense_down_proj_tp`** the same way.
   → same hash.
4. **SWA mirror** of (1)+(2) in `attention_swa.py`. → same hash.
5. **Full-row cast + overwrite RoPE idiom** for `full_rope_kv_cache` K and Q
   writes (qwen3/32b uses this; replaces a `pl.add(k_pass, 0.0)` workaround
   that was previously masking a different compile-time error). → same hash.
6. **Rewrite `full_rmsnorm_zc`** from `pl.spmd(BATCH//BATCH_TILE=1) +
   pl.range` to qwen3/32b's `pl.at(level=pl.Level.CORE_GROUP) +
   pl.pipeline(stage=4)` form. → same hash.
7. **`pl.parallel(user_batch)` → `pl.parallel(BATCH)`** (dynamic loop bound
   replaced with static Python constant). Tested with defensive `b_safe =
   pl.min(b, user_batch-1)` clamp. → same hash.
8. **TP=1 monkey-patch path** (`--tp-world-size 1`) which takes a different
   code path in `step3p5_decode.py:351-381`. → same hash.

All eight changes are compile-clean (smoke probe rc=0) and structurally
match the working qwen3/32b form in the same repo. None move the fault.

## Rule-out matrix

| Suspect | Outcome | Evidence |
|---|---|---|
| Mixed AIC+AIV dispatch | Ruled out | All `MixedKernels` removed via (1)–(4); `chip_orch.cpp` has only `rt_submit_aic_task` / `rt_submit_aiv_task` |
| pypto#1693 / PR#1718 (multi-output spmd SSA aliasing) | Ruled out | PR #1718 merged on pod (`pypto:0f4881cb`); no effect on this fault |
| CANN version | Ruled out | Reproduces on 8.5.1 and 9.0.0-beta.1 |
| Python version | Ruled out | Reproduces on 3.10 and 3.11 |
| PTOAS version | Ruled out | Reproduces on v0.43 and v0.44 |
| SDMA workspace (`aclnnShmemSdmaStarsQuery`) AICPU 0x2a | Ruled out | `SIMPLER_ENABLE_PTO_SDMA_WORKSPACE=OFF` already in effect; `nm -D libhost_runtime.so` returns zero `SdmaWorkspaceManager` symbols; AICPU 0x2a in the log is a cascade ~1900 ms after the AICore 0x800 |
| TP=8 canonical vs TP=1 monkey-patch | Ruled out | Same hash on both code paths |
| Driver `support_shmem_map_exbus=0` | Ruled out for single-rank | This driver flag affects multi-rank `aclrtIpcMemImportByKey` (filed separately); single-rank reproducer here uses no cross-card IPC |
| Dummy input out-of-bounds (`pos = ctx_len-1` underflow when `seq_lens=0`) | Ruled out | Current `step3p5_decode.py:552` sets `seq_lens = torch.ones(...)` so `pos = 0` is always in-bounds; `slot_mapping = torch.arange(B)` gives unique slots |
| `pl.parallel(dynamic)` vs `pl.parallel(static)` loop bound | Ruled out | (7) above |

## What we believe but cannot verify locally

The faulting VEC instruction is inside a model-kernel dispatched by
simpler's executor at the 7th FFTS+ MIX SQE slot. Strong candidate based on
the dispatch sequence we read out of `chip_orch.cpp`:

- `full_rope_kv_cache` (per-batch loop, AIV) — its position in dispatch
  order matches `tslot:6` directly.

The TileType::Vec declarations we read out of the generated
`full_rope_kv_cache.cpp` all use 32-byte-aligned widths (`float[1,32]`,
`bfloat16[1,32]`, `float[8,32]`, etc.). We did not find a structural
alignment violation by inspection. The pattern that differs from qwen3/32b
and is unique to step3p5 is **partial RoPE** (`ROTARY_HALF_FULL=32`,
rotary_dim=64, pass-through=64) versus full RoPE (rotary_dim=128). Whether
PTOAS lowers the partial-RoPE pattern into an unaligned VEC is what we'd
like the codegen team to confirm.

## What we are asking for

1. **Disassemble** `aicore_kernel.o` (140920 bytes, source
   `simpler/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp`
   + `simpler/src/a2a3/platform/onboard/aicore/kernel.cpp`) and the model
   kernels generated by PTOAS for the step3p5 reproducer at
   `next_levels/chip_orch/kernels/aiv/full_rope_kv_cache.cpp`. Find the VEC
   instruction at PC offset `~0xc00d9d0c` from kernel base.
2. **Confirm** whether the offending VEC originates in (a) the simpler
   executor, (b) a PTOAS-injected prologue/epilogue, or (c) the dispatched
   model kernel body.
3. If (c), tell us which IR pattern lowered to it — we will then either
   refactor step3p5 to avoid that pattern or push a fix into PTOAS.
4. If (a) or (b), the fix belongs upstream.

## Related local artifacts (available on request)

- Full plog: `/tmp/p15_devlog3/debug/plog/plog-*.log` (multiple runs across
  this session)
- Compiled chip orchestrator: `/tmp/p15_npu_d0/DecodeLayerDense_*/next_levels/chip_orch/`
- Working tree diffs (this session's eight Phase A refactors): under
  `workspace/pypto-lib/models/step3p5/{attention_full,attention_swa,decode_layer}.py`
- Session diagnosis log: `docs/step3p5/phases/15-singlerank-npu.md` — read
  "Phase A execution status (2026-06-11, end of session)" and
  "Simpler-runtime kernel identification (2026-06-11, follow-up)"

## See also

- `docs/upstream-issues/pypto-1702-followup.md` — earlier hypothesis (now
  ruled out) that PR#1718 should fix this fault; filed as pypto#1738 on
  2026-06-10
- `docs/upstream-issues/simpler-comm-init-segfault.md` — separate
  `comm_init` segfault, fixed via `--no-as-needed` link patch
  ([simpler#1018](https://github.com/hw-native-sys/simpler/issues/1018))
- `docs/upstream-issues/step3p5-multirank-shmem-exbus.md` — Phase 16 driver
  capability gap (filed jointly with this issue)

---

## 中文说明（2026-06-15 最新状态）

### 一句话总结

step3p5 单卡 decode bring-up 卡在 AICore `errcode 0x800 "VEC UB not aligned"`
导致的 507018，**最新 bisect（`P15_DISPATCH_LIMIT` 阶梯）已把 fault 钉死在
chip_orch 第 11 号 task = `full_head_gate`（AIV），不是早期认定的
`full_rope_kv_cache`**。同 chip + 同 CANN 上 `qwen3/32b` decode 端到端跑通
20 秒，证明 chip / driver / runtime 健康；问题是 step3p5 在 `full_head_gate`
这个特定 kernel 触发了一条未对齐的 VEC 指令。

### 复现

```bash
# 在装好 pypto + simpler + pto-isa + ptoas 的 venv 里
cd <pypto-lib>
python -m models.step3p5.step3p5_decode -p a2a3 -d 0 --no-smoke --dummy-weights
# 期望: chip 在第一次 kernel dispatch 后 ~22 ms 崩溃
# host: aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
# plog: errcode 0x800 errorStr: "The UB address accessed by the VEC instruction is not aligned"
#       fault kernel_name=aicore_kernel_0_mix_aic, hash=15033215677169261682, binSize=140920
```

`binSize=140920` 字节精确等于 simpler runtime 自带的 polling-dispatch
executor `simpler/build/lib/a2a3/onboard/tensormap_and_ringbuffer/aicore_kernel.o`，
所以"fault kernel"是 simpler 的 dispatch 跳板；真正 fault 的 VEC 指令在
被 dispatch 的 `full_head_gate` body 里。

### 反证：qwen3/32b 同环境通过

```bash
python -m models.qwen3.32b.qwen3_32b_decode -p a2a3 -d 0
# [RUN] PASS (20.15s)   'out' PASS  shape=(16, 8192) dtype=torch.bfloat16
```

同 chip / 同 CANN / 同 simpler / 同 pto-isa / 同 PTOAS / 同 Python。
chip + runtime 没毛病，是 step3p5 特有 IR 模式触发的。

### 定位：dispatch-limit bisect

通过在生成的 `chip_orch.cpp` 里把晚于某个 K 的 `rt_submit_*_task(K, ...)` 调用
注释掉再编 .so，`LIMIT=11` 是 PASS↔FAIL 的精确分水岭，对应单一新增的
`rt_submit_aiv_task(11, params_t11)` = `full_head_gate`（per-rank
head-wise sigmoid gate，source 在
`pypto-lib/models/step3p5/attention_full.py:564-586`，外层
`pl.spmd(BATCH // BATCH_TILE)` + 内层 `pl.range(NUM_HEADS_FULL_LOCAL)`）。

trace harness 文件: `pypto-lib/tools/p15_trace/run_with_trace.py`，
通过 `P15_DISPATCH_LIMIT` 环境变量切换 dispatch 上限。harness sits at
`compile_single_orchestration` 钩子，**不需要重编 simpler runtime**。

### 排除路径（耗时但确凿）

| 排除目标 | 状态 | 证据 |
|---|---|---|
| Mixed AIC+AIV dispatch | ✗ | Phase A 拆完 4 个 fa spmds + out_proj cube/cast 拆分 + dense MLP 拆分；`chip_orch.cpp` 已无 `MixedKernels`，hash 不变 |
| pypto#1693 / PR#1718 (multi-output spmd SSA aliasing) | ✗ | PR#1718 已 ff-merge 进 pypto main；fault 不动 |
| CANN 8.5.1 vs 9.0.0-beta.1 | ✗ | 两个 CANN 版本都崩，`libhccl.so` 加载已实地确认 |
| Python 3.10 vs 3.11 / PTOAS v0.43 vs v0.44 | ✗ | 所有组合都崩 |
| SDMA workspace AICPU 0x2a 级联 | ✗ | `nm -D libhost_runtime.so` 已无 `SdmaWorkspaceManager` 符号；plog 里 AICPU 0x2a 在 AICore 0x800 之后 1900 ms 出现，是 cascade 不是 cause |
| `pos = ctx_len-1` underflow on dummy `seq_lens=0` | ✗ | `step3p5_decode.py:552` 已设 `seq_lens=ones`，`pos=0` 永远 in-bounds |
| `pl.parallel(dynamic)` vs `pl.parallel(static)` | ✗ | 改成静态 `pl.parallel(BATCH=16)` 加 `b_safe` clamp，hash 不变 |
| `full_rope_kv_cache` kernel body | ✗ | rope kernel body 经 byte-wise diff vs 一个 PASS reference 已证明无差，即不是 rope kernel 本身的 bug |

### 我们想请上游做的事

1. **拿到 `aicore_kernel_0_mix_aic` 反汇编**（140920 字节，源
   `simpler/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp`
   + `simpler/src/a2a3/platform/onboard/aicore/kernel.cpp`）和 PTOAS 为
   step3p5 reproducer 编出来的 `next_levels/chip_orch/kernels/aiv/full_head_gate.cpp`
   反汇编。从 plog 的 `pc current` 找到那条未对齐的 VEC 指令。
2. **告知 fault 来源**: (a) simpler executor 自身、(b) PTOAS 注入的
   prologue/epilogue、还是 (c) `full_head_gate` body 内部。
3. **如果是 (c)**，告诉我们触发的 IR 模式，我们改 step3p5 model code
   绕开。
4. **如果是 (a) 或 (b)**，修在上游。

### 参考产物（按需要可提供）

- 完整 plog（多次失败运行）: `/tmp/p15_devlog3/debug/plog/plog-*.log`
- 编译产物: `/tmp/p15_npu_d0/DecodeLayerDense_*/next_levels/chip_orch/`
- 工作树代码（含 8 处 Phase A 拆分尝试）:
  `pypto-lib/models/step3p5/{attention_full,attention_swa,decode_layer,prefill_attention_full,prefill_attention_swa}.py`
  （已 push 到 `csy0225/pypto-lib:feat/step3p5-phase-a-split-scope`，
  draft PR 已开 = `hw-native-sys/pypto-lib#510`）
- 本地诊断详记: `docs/step3p5/phases/15-singlerank-npu.md`，重点读
  "Phase A execution status (2026-06-11, end of session)" 和
  "Simpler-runtime kernel identification (2026-06-11, follow-up)" 两节


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] AICore VEC UB-not-aligned (507018) on step3p5 single-rank, while qwen3/32b passes on the same chip #1036

[Bug] AICore VEC UB-not-aligned (507018) on step3p5 single-rank, while qwen3/32b passes on the same chip

Summary

Reproducer

Counter-example

What is at tslot:6 in step3p5

Mapping addr to the fault PC

Version pin table

What we tried (all leave the failure unchanged)

Rule-out matrix

What we believe but cannot verify locally

What we are asking for

Related local artifacts (available on request)

See also

中文说明（2026-06-15 最新状态）

一句话总结

复现

反证：qwen3/32b 同环境通过

定位：dispatch-limit bisect

排除路径（耗时但确凿）

我们想请上游做的事

参考产物（按需要可提供）

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

`P15_DISPATCH_LIMIT`	dispatched tasks	result
6	0..6 (rmsnorm → rope)	PASS
10	0..10 (incl. fa_fused 4-spmd block, no head_gate)	PASS
11	+ `full_head_gate`	FAIL 507018
12, 14, 22	progressively more	FAIL (same signature)

Component	HEAD	Notes
Chip	Ascend 910B2C (`Short_SoC_version=Ascend910B`)	`dav-c220-cube` / `dav-c220-vec`, 24 AIC / 48 AIV / 6 AICPU per die
Driver	`npu-smi 25.5.1`
CANN	`9.0.0-beta.1`	Also reproduced on `8.5.1` (verified live `libhccl.so` load)
Python	3.11.14	Reproduced on 3.10 too
simpler	branch `fix/tensor-zero-size-view-bounds:0cd317e7` (= PR #1023 plus host-side `--no-as-needed` link patch + `comm_hccl.cpp` P2P best-effort)	Also reproduces on `main:afb5c5a9`
pto-isa	`main:109c9f72`
PTOAS	binary `v0.44` (source `main:29a8af28`)	Also reproduces on `v0.43`
pypto	`main:0f4881cb` (post PR#1718 merge)
pypto-lib	`main:9c5593fb` + step3p5 working-tree WIP	qwen3/32b passes against the same SHA

Suspect	Outcome	Evidence
Mixed AIC+AIV dispatch	Ruled out	All `MixedKernels` removed via (1)–(4); `chip_orch.cpp` has only `rt_submit_aic_task` / `rt_submit_aiv_task`
pypto#1693 / PR#1718 (multi-output spmd SSA aliasing)	Ruled out	PR #1718 merged on pod (`pypto:0f4881cb`); no effect on this fault
CANN version	Ruled out	Reproduces on 8.5.1 and 9.0.0-beta.1
Python version	Ruled out	Reproduces on 3.10 and 3.11
PTOAS version	Ruled out	Reproduces on v0.43 and v0.44
SDMA workspace (`aclnnShmemSdmaStarsQuery`) AICPU 0x2a	Ruled out	`SIMPLER_ENABLE_PTO_SDMA_WORKSPACE=OFF` already in effect; `nm -D libhost_runtime.so` returns zero `SdmaWorkspaceManager` symbols; AICPU 0x2a in the log is a cascade ~1900 ms after the AICore 0x800
TP=8 canonical vs TP=1 monkey-patch	Ruled out	Same hash on both code paths
Driver `support_shmem_map_exbus=0`	Ruled out for single-rank	This driver flag affects multi-rank `aclrtIpcMemImportByKey` (filed separately); single-rank reproducer here uses no cross-card IPC
Dummy input out-of-bounds (`pos = ctx_len-1` underflow when `seq_lens=0`)	Ruled out	Current `step3p5_decode.py:552` sets `seq_lens = torch.ones(...)` so `pos = 0` is always in-bounds; `slot_mapping = torch.arange(B)` gives unique slots
`pl.parallel(dynamic)` vs `pl.parallel(static)` loop bound	Ruled out	(7) above

排除目标	状态	证据
Mixed AIC+AIV dispatch	✗	Phase A 拆完 4 个 fa spmds + out_proj cube/cast 拆分 + dense MLP 拆分；`chip_orch.cpp` 已无 `MixedKernels`，hash 不变
pypto#1693 / PR#1718 (multi-output spmd SSA aliasing)	✗	PR#1718 已 ff-merge 进 pypto main；fault 不动
CANN 8.5.1 vs 9.0.0-beta.1	✗	两个 CANN 版本都崩，`libhccl.so` 加载已实地确认
Python 3.10 vs 3.11 / PTOAS v0.43 vs v0.44	✗	所有组合都崩
SDMA workspace AICPU 0x2a 级联	✗	`nm -D libhost_runtime.so` 已无 `SdmaWorkspaceManager` 符号；plog 里 AICPU 0x2a 在 AICore 0x800 之后 1900 ms 出现，是 cascade 不是 cause
`pos = ctx_len-1` underflow on dummy `seq_lens=0`	✗	`step3p5_decode.py:552` 已设 `seq_lens=ones`，`pos=0` 永远 in-bounds
`pl.parallel(dynamic)` vs `pl.parallel(static)`	✗	改成静态 `pl.parallel(BATCH=16)` 加 `b_safe` clamp，hash 不变
`full_rope_kv_cache` kernel body	✗	rope kernel body 经 byte-wise diff vs 一个 PASS reference 已证明无差，即不是 rope kernel 本身的 bug

[Bug] AICore VEC UB-not-aligned (507018) on step3p5 single-rank, while qwen3/32b passes on the same chip #1036

Description

[Bug] AICore VEC UB-not-aligned (507018) on step3p5 single-rank, while qwen3/32b passes on the same chip

Summary

Reproducer

Counter-example

What is at tslot:6 in step3p5

Mapping addr to the fault PC

Version pin table

What we tried (all leave the failure unchanged)

Rule-out matrix

What we believe but cannot verify locally

What we are asking for

Related local artifacts (available on request)

See also

中文说明（2026-06-15 最新状态）

一句话总结

复现

反证：qwen3/32b 同环境通过

定位：dispatch-limit bisect

排除路径（耗时但确凿）

我们想请上游做的事

参考产物（按需要可提供）

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions