[Bug] AICore VEC UB-not-aligned (507018) on step3p5 single-rank, while qwen3/32b passes on the same chip
Repository: hw-native-sys/simpler (primary) — likely also touches
hw-native-sys/pypto or hw-native-sys/ptoas codegen; refile to the right
component on triage.
Filed as: simpler#1036 (2026-06-12)
Severity: blocking — Phase 15 single-rank NPU bring-up cannot proceed,
and any TP=N follow-up will hit the same fault per-rank.
2026-06-15 update — kernel pinned to full_head_gate (task 11), not
full_rope_kv_cache. A subsequent P15_DISPATCH_LIMIT bisect that
comments out late rt_submit_*_task(K, …) calls in the generated
chip_orch.cpp before the .so build flipped the result at exactly
LIMIT=11:
P15_DISPATCH_LIMIT |
dispatched tasks |
result |
| 6 |
0..6 (rmsnorm → rope) |
PASS |
| 10 |
0..10 (incl. fa_fused 4-spmd block, no head_gate) |
PASS |
| 11 |
+ full_head_gate |
FAIL 507018 |
| 12, 14, 22 |
progressively more |
FAIL (same signature) |
The tslot:6 field in plog is an FFTS+ internal slot, NOT the
chip_orch task index — the apparent match to "task 6 =
full_rope_kv_cache" was a misread that this issue's first version
propagated. The actual culprit is dispatched at chip_orch task 11 (AIV)
= the per-rank head-wise sigmoid gate. This also matches the local
TASK-30 full_head_gate AIV0 stall entry in our backlog (CLAUDE.md);
507018 single-rank and TASK-30 are the same bug.
Reproducer pinned more precisely: pypto-lib/tools/p15_trace/run_with_trace.py
with P15_DISPATCH_LIMIT=10 PASSes, P15_DISPATCH_LIMIT=11 FAILs. The
trace harness sits at the compile_single_orchestration chokepoint;
no simpler-runtime rebuild required.
What this changes for the maintainer ask below: the disassembly
request is now scoped to the dispatched function for full_head_gate
(AIV, source pypto-lib/models/step3p5/attention_full.py:564-586 —
outer pl.spmd(BATCH // BATCH_TILE) + pl.range(NUM_HEADS_FULL_LOCAL)
head-wise sigmoid then assemble into attn_out_gated), not the rope
body. The rest of this issue's diagnosis (executor binary identification,
qwen3 counter-example, version pin matrix, what we tried) remains valid.
Supersedes the operating hypothesis in pypto-1702-followup.md (filed
2026-06-10 as pypto#1738 = "PR#1718 doesn't fix 507018 — SSA aliasing in
another path"). After this session's eight Phase A model-side mitigations
all failed to shift the fault hash, the SSA-aliasing theory is now ruled
out. The crashing kernel binary is shown below to be simpler runtime's
own AICore polling-dispatch executor (binSize 140920 = exact match for
simpler/build/lib/a2a3/onboard/tensormap_and_ringbuffer/aicore_kernel.o),
so the fix path moves out of pypto codegen and into simpler/PTOAS.
Summary
Running step3p5's step3p5_decode -p a2a3 -d 0 --no-smoke --dummy-weights
crashes the chip ~22 ms after the first kernel dispatch with:
- Host:
aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
- Plog:
errcode 0x800 errorStr: The UB address accessed by the VEC instruction is not aligned. ... subErrType:4, tslot:6
fault kernel_name=aicore_kernel_0_mix_aic, hash=15033215677169261682, binSize=140920
That binSize=140920 matches exactly the simpler runtime's compiled
build/lib/a2a3/onboard/tensormap_and_ringbuffer/aicore_kernel.o (140920
bytes). The "fault kernel" is the simpler-runtime polling-dispatch executor
itself; the genuinely faulting VEC instruction lives in a function dispatched
via payload->function_bin_addr from execute_task (FFTS+ MIX stream,
tslot:6). CANN's PrintErrorInfoForDavinciTask reports the entry-point
binary, not the dispatched function, so the executor hash is constant
across all model-side mitigations we tried.
A counter-example in the same repo passes cleanly on the same chip / CANN /
simpler / pto-isa / PTOAS: see "Counter-example" below.
Reproducer
Single command from a --no-smoke --dummy-weights invocation against the
current pypto-lib/models/step3p5/ working tree:
# venv with pypto + simpler + pto-isa + ptoas installed
cd /path/to/pypto-lib
python -m models.step3p5.step3p5_decode \
-p a2a3 -d 0 --no-smoke --dummy-weights
Observed (truncated):
[chip_process pid=… dev=0] ready
[ERROR] sync_run_streams: [device_runner_base.cpp:877]
aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
[ERROR] recover_device_or_mark_unusable: [device_runner.cpp:456]
Device unrecoverable after AICore error 507018:
aclrtSynchronizeDeviceWithTimeout failed: 507015.
RuntimeError: WorkerThread::dispatch_process: child failed (code=1):
chip_process dev=0: RuntimeError:
run_prepared failed with code 507018
Plog (from ASCEND_PROCESS_LOG_PATH=$LOGDIR then find $LOGDIR -name 'plog-*.log'):
AllKernelRegister: Runtime_alloc_size 1240, type=0,
kernel_name=aicore_kernel_0_mix_aic, tilingkey=0, offset=144,
length=232, dfxAddr=0x0, dfxSize=0, kernelVfType=0, shareMemSize=0.
LaunchKernelWithHandle: kernel info : device_id=0, stream_id=45, task_id=0,
kernelType=0, kernel_name=aicore_kernel_0_mix_aic,
arg_size=8, mixType=3, taskRation=2, funcType=0,
addr1=0x124000000090, addr2=0x12400000076c, flag=0, kernelFlag=0x0,
qos=0, partId=0, schemMode=1, infoAddr=(nil), atomicIndex=0.
FillFftsMixSqeForDavinciTask: kernelNames_=aicore_kernel_0_mix_aic,
stackSize=32768.
PrintCoreInfo: The extend info: errcode:(0, 0x800, 0)
errorStr: The UB address accessed by the VEC instruction is not aligned.
fixp_error0 info: 0xcc175f7, fixp_error1 info: 0x8e,
fsmId:0, tslot:6, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.
PrintErrorInfoForDavinciTask: Aicore kernel execute failed, device_id=0,
stream_id=45, report_stream_id=45, task_id=0, flip_num=0,
fault kernel_name=aicore_kernel_0_mix_aic, fault kernel info ext=none,
program id=0, hash=15033215677169261682.
GetBinAndKernelNameExceptionArgs: kernel_name=aicore_kernel_0_mix_aic,
kernelNameSize=23, binSize=140920.
A subsequent PrintAicpuErrorInfo: funcName=simpler_aicpu_exec, errorCode=0x2a appears ~1900 ms later; this is a cascade from the
chip's fault state, not a primary cause (verified by timestamp ordering).
Counter-example
The qwen3-32b decode reference in the same pypto-lib, with the same
single-card harness, passes end-to-end in ~20 seconds:
cd /path/to/pypto-lib
python -m models.qwen3.32b.qwen3_32b_decode -p a2a3 -d 0
# [RUN] PASS (20.15s)
# [RUN] 'out' PASS shape=(16, 8192) dtype=torch.bfloat16
Same chip / CANN / simpler / pto-isa / PTOAS / Python. The chip + runtime
stack itself is healthy. step3p5 specifically triggers the fault.
What is at tslot:6 in step3p5
Reading the compiled next_levels/chip_orch/orchestration/chip_orch.cpp,
task dispatch index 6 is full_rope_kv_cache — a per-batch AIV scope that
applies partial RoPE (rotary_dim=64 of head_dim=128, pass-through tail=64)
and writes K/V cache + a padded Q block. The relevant Python source is
pypto-lib/models/step3p5/attention_full.py:387-484 (scope 2).
Mapping addr to the fault PC
addr1=0x124000000090 and addr2=0x12400000076c are the chip-side virtual
addresses of the executor entry. Plog PrintCoreInfo reports
pc current: 0x12c0c00d9d9c, which is ~0xc00d9d0c bytes past addr1. The
executor binary is only 140920 bytes (= 0x22678), so the PC is not in
the executor — it is in a dispatched function loaded elsewhere in chip
address space (function_bin_addr from PTO2DispatchPayload). We were
unable to disassemble device-side binaries from inside the container (no
gdb / objdump for AICore), so we cannot localise the misaligned VEC at the
instruction level. Asking the maintainer team to do this localisation is
the primary purpose of this report.
Version pin table
| Component |
HEAD |
Notes |
| Chip |
Ascend 910B2C (Short_SoC_version=Ascend910B) |
dav-c220-cube / dav-c220-vec, 24 AIC / 48 AIV / 6 AICPU per die |
| Driver |
npu-smi 25.5.1 |
|
| CANN |
9.0.0-beta.1 |
Also reproduced on 8.5.1 (verified live libhccl.so load) |
| Python |
3.11.14 |
Reproduced on 3.10 too |
| simpler |
branch fix/tensor-zero-size-view-bounds:0cd317e7 (= PR #1023 plus host-side --no-as-needed link patch + comm_hccl.cpp P2P best-effort) |
Also reproduces on main:afb5c5a9 |
| pto-isa |
main:109c9f72 |
|
| PTOAS |
binary v0.44 (source main:29a8af28) |
Also reproduces on v0.43 |
| pypto |
main:0f4881cb (post PR#1718 merge) |
|
| pypto-lib |
main:9c5593fb + step3p5 working-tree WIP |
qwen3/32b passes against the same SHA |
The constancy of hash=15033215677169261682, binSize=140920, addr1=0x124000000090, addr2=0x12400000076c, tslot:6 across all eight
model-side mitigations below (which change every named step3p5 kernel) is
decisive evidence that the faulting kernel is not in step3p5 Python
source — it is in code reachable through simpler's polling-dispatch
executor.
What we tried (all leave the failure unchanged)
- Phase A split-scope refactor of
full_fa_fused into 4 sequential
spmds (full_qk_matmul AIC → full_softmax AIV → full_sv_matmul AIC →
full_online_softmax AIV), mirroring qwen3/32b's pattern. Eliminates the
mixed AIC+AIV dispatch entirely. chip_orch.cpp verified — no remaining
MixedKernels groups. → same hash, same tslot:6.
- Split
full_out_proj into pure-cube matmul + pure-vec cast via FP32
GM scratch handoff. → same hash.
- Split
dense_gate_up_silu_tp and dense_down_proj_tp the same way.
→ same hash.
- SWA mirror of (1)+(2) in
attention_swa.py. → same hash.
- Full-row cast + overwrite RoPE idiom for
full_rope_kv_cache K and Q
writes (qwen3/32b uses this; replaces a pl.add(k_pass, 0.0) workaround
that was previously masking a different compile-time error). → same hash.
- Rewrite
full_rmsnorm_zc from pl.spmd(BATCH//BATCH_TILE=1) + pl.range to qwen3/32b's pl.at(level=pl.Level.CORE_GROUP) + pl.pipeline(stage=4) form. → same hash.
pl.parallel(user_batch) → pl.parallel(BATCH) (dynamic loop bound
replaced with static Python constant). Tested with defensive b_safe = pl.min(b, user_batch-1) clamp. → same hash.
- TP=1 monkey-patch path (
--tp-world-size 1) which takes a different
code path in step3p5_decode.py:351-381. → same hash.
All eight changes are compile-clean (smoke probe rc=0) and structurally
match the working qwen3/32b form in the same repo. None move the fault.
Rule-out matrix
| Suspect |
Outcome |
Evidence |
| Mixed AIC+AIV dispatch |
Ruled out |
All MixedKernels removed via (1)–(4); chip_orch.cpp has only rt_submit_aic_task / rt_submit_aiv_task |
| pypto#1693 / PR#1718 (multi-output spmd SSA aliasing) |
Ruled out |
PR #1718 merged on pod (pypto:0f4881cb); no effect on this fault |
| CANN version |
Ruled out |
Reproduces on 8.5.1 and 9.0.0-beta.1 |
| Python version |
Ruled out |
Reproduces on 3.10 and 3.11 |
| PTOAS version |
Ruled out |
Reproduces on v0.43 and v0.44 |
SDMA workspace (aclnnShmemSdmaStarsQuery) AICPU 0x2a |
Ruled out |
SIMPLER_ENABLE_PTO_SDMA_WORKSPACE=OFF already in effect; nm -D libhost_runtime.so returns zero SdmaWorkspaceManager symbols; AICPU 0x2a in the log is a cascade ~1900 ms after the AICore 0x800 |
| TP=8 canonical vs TP=1 monkey-patch |
Ruled out |
Same hash on both code paths |
Driver support_shmem_map_exbus=0 |
Ruled out for single-rank |
This driver flag affects multi-rank aclrtIpcMemImportByKey (filed separately); single-rank reproducer here uses no cross-card IPC |
Dummy input out-of-bounds (pos = ctx_len-1 underflow when seq_lens=0) |
Ruled out |
Current step3p5_decode.py:552 sets seq_lens = torch.ones(...) so pos = 0 is always in-bounds; slot_mapping = torch.arange(B) gives unique slots |
pl.parallel(dynamic) vs pl.parallel(static) loop bound |
Ruled out |
(7) above |
What we believe but cannot verify locally
The faulting VEC instruction is inside a model-kernel dispatched by
simpler's executor at the 7th FFTS+ MIX SQE slot. Strong candidate based on
the dispatch sequence we read out of chip_orch.cpp:
full_rope_kv_cache (per-batch loop, AIV) — its position in dispatch
order matches tslot:6 directly.
The TileType::Vec declarations we read out of the generated
full_rope_kv_cache.cpp all use 32-byte-aligned widths (float[1,32],
bfloat16[1,32], float[8,32], etc.). We did not find a structural
alignment violation by inspection. The pattern that differs from qwen3/32b
and is unique to step3p5 is partial RoPE (ROTARY_HALF_FULL=32,
rotary_dim=64, pass-through=64) versus full RoPE (rotary_dim=128). Whether
PTOAS lowers the partial-RoPE pattern into an unaligned VEC is what we'd
like the codegen team to confirm.
What we are asking for
- Disassemble
aicore_kernel.o (140920 bytes, source
simpler/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
simpler/src/a2a3/platform/onboard/aicore/kernel.cpp) and the model
kernels generated by PTOAS for the step3p5 reproducer at
next_levels/chip_orch/kernels/aiv/full_rope_kv_cache.cpp. Find the VEC
instruction at PC offset ~0xc00d9d0c from kernel base.
- Confirm whether the offending VEC originates in (a) the simpler
executor, (b) a PTOAS-injected prologue/epilogue, or (c) the dispatched
model kernel body.
- If (c), tell us which IR pattern lowered to it — we will then either
refactor step3p5 to avoid that pattern or push a fix into PTOAS.
- If (a) or (b), the fix belongs upstream.
Related local artifacts (available on request)
- Full plog:
/tmp/p15_devlog3/debug/plog/plog-*.log (multiple runs across
this session)
- Compiled chip orchestrator:
/tmp/p15_npu_d0/DecodeLayerDense_*/next_levels/chip_orch/
- Working tree diffs (this session's eight Phase A refactors): under
workspace/pypto-lib/models/step3p5/{attention_full,attention_swa,decode_layer}.py
- Session diagnosis log:
docs/step3p5/phases/15-singlerank-npu.md — read
"Phase A execution status (2026-06-11, end of session)" and
"Simpler-runtime kernel identification (2026-06-11, follow-up)"
See also
docs/upstream-issues/pypto-1702-followup.md — earlier hypothesis (now
ruled out) that PR#1718 should fix this fault; filed as pypto#1738 on
2026-06-10
docs/upstream-issues/simpler-comm-init-segfault.md — separate
comm_init segfault, fixed via --no-as-needed link patch
(simpler#1018)
docs/upstream-issues/step3p5-multirank-shmem-exbus.md — Phase 16 driver
capability gap (filed jointly with this issue)
中文说明(2026-06-15 最新状态)
一句话总结
step3p5 单卡 decode bring-up 卡在 AICore errcode 0x800 "VEC UB not aligned"
导致的 507018,最新 bisect(P15_DISPATCH_LIMIT 阶梯)已把 fault 钉死在
chip_orch 第 11 号 task = full_head_gate(AIV),不是早期认定的
full_rope_kv_cache。同 chip + 同 CANN 上 qwen3/32b decode 端到端跑通
20 秒,证明 chip / driver / runtime 健康;问题是 step3p5 在 full_head_gate
这个特定 kernel 触发了一条未对齐的 VEC 指令。
复现
# 在装好 pypto + simpler + pto-isa + ptoas 的 venv 里
cd <pypto-lib>
python -m models.step3p5.step3p5_decode -p a2a3 -d 0 --no-smoke --dummy-weights
# 期望: chip 在第一次 kernel dispatch 后 ~22 ms 崩溃
# host: aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
# plog: errcode 0x800 errorStr: "The UB address accessed by the VEC instruction is not aligned"
# fault kernel_name=aicore_kernel_0_mix_aic, hash=15033215677169261682, binSize=140920
binSize=140920 字节精确等于 simpler runtime 自带的 polling-dispatch
executor simpler/build/lib/a2a3/onboard/tensormap_and_ringbuffer/aicore_kernel.o,
所以"fault kernel"是 simpler 的 dispatch 跳板;真正 fault 的 VEC 指令在
被 dispatch 的 full_head_gate body 里。
反证:qwen3/32b 同环境通过
python -m models.qwen3.32b.qwen3_32b_decode -p a2a3 -d 0
# [RUN] PASS (20.15s) 'out' PASS shape=(16, 8192) dtype=torch.bfloat16
同 chip / 同 CANN / 同 simpler / 同 pto-isa / 同 PTOAS / 同 Python。
chip + runtime 没毛病,是 step3p5 特有 IR 模式触发的。
定位:dispatch-limit bisect
通过在生成的 chip_orch.cpp 里把晚于某个 K 的 rt_submit_*_task(K, ...) 调用
注释掉再编 .so,LIMIT=11 是 PASS↔FAIL 的精确分水岭,对应单一新增的
rt_submit_aiv_task(11, params_t11) = full_head_gate(per-rank
head-wise sigmoid gate,source 在
pypto-lib/models/step3p5/attention_full.py:564-586,外层
pl.spmd(BATCH // BATCH_TILE) + 内层 pl.range(NUM_HEADS_FULL_LOCAL))。
trace harness 文件: pypto-lib/tools/p15_trace/run_with_trace.py,
通过 P15_DISPATCH_LIMIT 环境变量切换 dispatch 上限。harness sits at
compile_single_orchestration 钩子,不需要重编 simpler runtime。
排除路径(耗时但确凿)
| 排除目标 |
状态 |
证据 |
| Mixed AIC+AIV dispatch |
✗ |
Phase A 拆完 4 个 fa spmds + out_proj cube/cast 拆分 + dense MLP 拆分;chip_orch.cpp 已无 MixedKernels,hash 不变 |
| pypto#1693 / PR#1718 (multi-output spmd SSA aliasing) |
✗ |
PR#1718 已 ff-merge 进 pypto main;fault 不动 |
| CANN 8.5.1 vs 9.0.0-beta.1 |
✗ |
两个 CANN 版本都崩,libhccl.so 加载已实地确认 |
| Python 3.10 vs 3.11 / PTOAS v0.43 vs v0.44 |
✗ |
所有组合都崩 |
| SDMA workspace AICPU 0x2a 级联 |
✗ |
nm -D libhost_runtime.so 已无 SdmaWorkspaceManager 符号;plog 里 AICPU 0x2a 在 AICore 0x800 之后 1900 ms 出现,是 cascade 不是 cause |
pos = ctx_len-1 underflow on dummy seq_lens=0 |
✗ |
step3p5_decode.py:552 已设 seq_lens=ones,pos=0 永远 in-bounds |
pl.parallel(dynamic) vs pl.parallel(static) |
✗ |
改成静态 pl.parallel(BATCH=16) 加 b_safe clamp,hash 不变 |
full_rope_kv_cache kernel body |
✗ |
rope kernel body 经 byte-wise diff vs 一个 PASS reference 已证明无差,即不是 rope kernel 本身的 bug |
我们想请上游做的事
- 拿到
aicore_kernel_0_mix_aic 反汇编(140920 字节,源
simpler/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
simpler/src/a2a3/platform/onboard/aicore/kernel.cpp)和 PTOAS 为
step3p5 reproducer 编出来的 next_levels/chip_orch/kernels/aiv/full_head_gate.cpp
反汇编。从 plog 的 pc current 找到那条未对齐的 VEC 指令。
- 告知 fault 来源: (a) simpler executor 自身、(b) PTOAS 注入的
prologue/epilogue、还是 (c) full_head_gate body 内部。
- 如果是 (c),告诉我们触发的 IR 模式,我们改 step3p5 model code
绕开。
- 如果是 (a) 或 (b),修在上游。
参考产物(按需要可提供)
- 完整 plog(多次失败运行):
/tmp/p15_devlog3/debug/plog/plog-*.log
- 编译产物:
/tmp/p15_npu_d0/DecodeLayerDense_*/next_levels/chip_orch/
- 工作树代码(含 8 处 Phase A 拆分尝试):
pypto-lib/models/step3p5/{attention_full,attention_swa,decode_layer,prefill_attention_full,prefill_attention_swa}.py
(已 push 到 csy0225/pypto-lib:feat/step3p5-phase-a-split-scope,
draft PR 已开 = hw-native-sys/pypto-lib#510)
- 本地诊断详记:
docs/step3p5/phases/15-singlerank-npu.md,重点读
"Phase A execution status (2026-06-11, end of session)" 和
"Simpler-runtime kernel identification (2026-06-11, follow-up)" 两节
[Bug] AICore VEC UB-not-aligned (507018) on step3p5 single-rank, while qwen3/32b passes on the same chip
Repository:
hw-native-sys/simpler(primary) — likely also toucheshw-native-sys/pyptoorhw-native-sys/ptoascodegen; refile to the rightcomponent on triage.
Filed as: simpler#1036 (2026-06-12)
Severity: blocking — Phase 15 single-rank NPU bring-up cannot proceed,
and any TP=N follow-up will hit the same fault per-rank.
Summary
Running step3p5's
step3p5_decode -p a2a3 -d 0 --no-smoke --dummy-weightscrashes the chip ~22 ms after the first kernel dispatch with:
aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018errcode 0x800 errorStr: The UB address accessed by the VEC instruction is not aligned. ... subErrType:4, tslot:6fault kernel_name=aicore_kernel_0_mix_aic, hash=15033215677169261682, binSize=140920That
binSize=140920matches exactly the simpler runtime's compiledbuild/lib/a2a3/onboard/tensormap_and_ringbuffer/aicore_kernel.o(140920bytes). The "fault kernel" is the simpler-runtime polling-dispatch executor
itself; the genuinely faulting VEC instruction lives in a function dispatched
via
payload->function_bin_addrfromexecute_task(FFTS+ MIX stream,tslot:6). CANN'sPrintErrorInfoForDavinciTaskreports the entry-pointbinary, not the dispatched function, so the executor hash is constant
across all model-side mitigations we tried.
A counter-example in the same repo passes cleanly on the same chip / CANN /
simpler / pto-isa / PTOAS: see "Counter-example" below.
Reproducer
Single command from a
--no-smoke --dummy-weightsinvocation against thecurrent
pypto-lib/models/step3p5/working tree:Observed (truncated):
Plog (from
ASCEND_PROCESS_LOG_PATH=$LOGDIRthenfind $LOGDIR -name 'plog-*.log'):A subsequent
PrintAicpuErrorInfo: funcName=simpler_aicpu_exec, errorCode=0x2aappears ~1900 ms later; this is a cascade from thechip's fault state, not a primary cause (verified by timestamp ordering).
Counter-example
The qwen3-32b decode reference in the same
pypto-lib, with the samesingle-card harness, passes end-to-end in ~20 seconds:
Same chip / CANN / simpler / pto-isa / PTOAS / Python. The chip + runtime
stack itself is healthy. step3p5 specifically triggers the fault.
What is at tslot:6 in step3p5
Reading the compiled
next_levels/chip_orch/orchestration/chip_orch.cpp,task dispatch index 6 is
full_rope_kv_cache— a per-batch AIV scope thatapplies partial RoPE (rotary_dim=64 of head_dim=128, pass-through tail=64)
and writes K/V cache + a padded Q block. The relevant Python source is
pypto-lib/models/step3p5/attention_full.py:387-484(scope 2).Mapping addr to the fault PC
addr1=0x124000000090andaddr2=0x12400000076care the chip-side virtualaddresses of the executor entry. Plog
PrintCoreInforeportspc current: 0x12c0c00d9d9c, which is ~0xc00d9d0c bytes pastaddr1. Theexecutor binary is only 140920 bytes (= 0x22678), so the PC is not in
the executor — it is in a dispatched function loaded elsewhere in chip
address space (
function_bin_addrfromPTO2DispatchPayload). We wereunable to disassemble device-side binaries from inside the container (no
gdb / objdump for AICore), so we cannot localise the misaligned VEC at the
instruction level. Asking the maintainer team to do this localisation is
the primary purpose of this report.
Version pin table
Short_SoC_version=Ascend910B)dav-c220-cube/dav-c220-vec, 24 AIC / 48 AIV / 6 AICPU per dienpu-smi 25.5.19.0.0-beta.18.5.1(verified livelibhccl.soload)fix/tensor-zero-size-view-bounds:0cd317e7(= PR #1023 plus host-side--no-as-neededlink patch +comm_hccl.cppP2P best-effort)main:afb5c5a9main:109c9f72v0.44(sourcemain:29a8af28)v0.43main:0f4881cb(post PR#1718 merge)main:9c5593fb+ step3p5 working-tree WIPThe constancy of
hash=15033215677169261682, binSize=140920, addr1=0x124000000090, addr2=0x12400000076c, tslot:6across all eightmodel-side mitigations below (which change every named step3p5 kernel) is
decisive evidence that the faulting kernel is not in step3p5 Python
source — it is in code reachable through simpler's polling-dispatch
executor.
What we tried (all leave the failure unchanged)
full_fa_fusedinto 4 sequentialspmds (
full_qk_matmulAIC →full_softmaxAIV →full_sv_matmulAIC →full_online_softmaxAIV), mirroring qwen3/32b's pattern. Eliminates themixed AIC+AIV dispatch entirely.
chip_orch.cppverified — no remainingMixedKernelsgroups. → same hash, same tslot:6.full_out_projinto pure-cube matmul + pure-vec cast via FP32GM scratch handoff. → same hash.
dense_gate_up_silu_tpanddense_down_proj_tpthe same way.→ same hash.
attention_swa.py. → same hash.full_rope_kv_cacheK and Qwrites (qwen3/32b uses this; replaces a
pl.add(k_pass, 0.0)workaroundthat was previously masking a different compile-time error). → same hash.
full_rmsnorm_zcfrompl.spmd(BATCH//BATCH_TILE=1) + pl.rangeto qwen3/32b'spl.at(level=pl.Level.CORE_GROUP) + pl.pipeline(stage=4)form. → same hash.pl.parallel(user_batch)→pl.parallel(BATCH)(dynamic loop boundreplaced with static Python constant). Tested with defensive
b_safe = pl.min(b, user_batch-1)clamp. → same hash.--tp-world-size 1) which takes a differentcode path in
step3p5_decode.py:351-381. → same hash.All eight changes are compile-clean (smoke probe rc=0) and structurally
match the working qwen3/32b form in the same repo. None move the fault.
Rule-out matrix
MixedKernelsremoved via (1)–(4);chip_orch.cpphas onlyrt_submit_aic_task/rt_submit_aiv_taskpypto:0f4881cb); no effect on this faultaclnnShmemSdmaStarsQuery) AICPU 0x2aSIMPLER_ENABLE_PTO_SDMA_WORKSPACE=OFFalready in effect;nm -D libhost_runtime.soreturns zeroSdmaWorkspaceManagersymbols; AICPU 0x2a in the log is a cascade ~1900 ms after the AICore 0x800support_shmem_map_exbus=0aclrtIpcMemImportByKey(filed separately); single-rank reproducer here uses no cross-card IPCpos = ctx_len-1underflow whenseq_lens=0)step3p5_decode.py:552setsseq_lens = torch.ones(...)sopos = 0is always in-bounds;slot_mapping = torch.arange(B)gives unique slotspl.parallel(dynamic)vspl.parallel(static)loop boundWhat we believe but cannot verify locally
The faulting VEC instruction is inside a model-kernel dispatched by
simpler's executor at the 7th FFTS+ MIX SQE slot. Strong candidate based on
the dispatch sequence we read out of
chip_orch.cpp:full_rope_kv_cache(per-batch loop, AIV) — its position in dispatchorder matches
tslot:6directly.The TileType::Vec declarations we read out of the generated
full_rope_kv_cache.cppall use 32-byte-aligned widths (float[1,32],bfloat16[1,32],float[8,32], etc.). We did not find a structuralalignment violation by inspection. The pattern that differs from qwen3/32b
and is unique to step3p5 is partial RoPE (
ROTARY_HALF_FULL=32,rotary_dim=64, pass-through=64) versus full RoPE (rotary_dim=128). Whether
PTOAS lowers the partial-RoPE pattern into an unaligned VEC is what we'd
like the codegen team to confirm.
What we are asking for
aicore_kernel.o(140920 bytes, sourcesimpler/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cppsimpler/src/a2a3/platform/onboard/aicore/kernel.cpp) and the modelkernels generated by PTOAS for the step3p5 reproducer at
next_levels/chip_orch/kernels/aiv/full_rope_kv_cache.cpp. Find the VECinstruction at PC offset
~0xc00d9d0cfrom kernel base.executor, (b) a PTOAS-injected prologue/epilogue, or (c) the dispatched
model kernel body.
refactor step3p5 to avoid that pattern or push a fix into PTOAS.
Related local artifacts (available on request)
/tmp/p15_devlog3/debug/plog/plog-*.log(multiple runs acrossthis session)
/tmp/p15_npu_d0/DecodeLayerDense_*/next_levels/chip_orch/workspace/pypto-lib/models/step3p5/{attention_full,attention_swa,decode_layer}.pydocs/step3p5/phases/15-singlerank-npu.md— read"Phase A execution status (2026-06-11, end of session)" and
"Simpler-runtime kernel identification (2026-06-11, follow-up)"
See also
docs/upstream-issues/pypto-1702-followup.md— earlier hypothesis (nowruled out) that PR#1718 should fix this fault; filed as pypto#1738 on
2026-06-10
docs/upstream-issues/simpler-comm-init-segfault.md— separatecomm_initsegfault, fixed via--no-as-neededlink patch(simpler#1018)
docs/upstream-issues/step3p5-multirank-shmem-exbus.md— Phase 16 drivercapability gap (filed jointly with this issue)
中文说明(2026-06-15 最新状态)
一句话总结
step3p5 单卡 decode bring-up 卡在 AICore
errcode 0x800 "VEC UB not aligned"导致的 507018,最新 bisect(
P15_DISPATCH_LIMIT阶梯)已把 fault 钉死在chip_orch 第 11 号 task =
full_head_gate(AIV),不是早期认定的full_rope_kv_cache。同 chip + 同 CANN 上qwen3/32bdecode 端到端跑通20 秒,证明 chip / driver / runtime 健康;问题是 step3p5 在
full_head_gate这个特定 kernel 触发了一条未对齐的 VEC 指令。
复现
binSize=140920字节精确等于 simpler runtime 自带的 polling-dispatchexecutor
simpler/build/lib/a2a3/onboard/tensormap_and_ringbuffer/aicore_kernel.o,所以"fault kernel"是 simpler 的 dispatch 跳板;真正 fault 的 VEC 指令在
被 dispatch 的
full_head_gatebody 里。反证:qwen3/32b 同环境通过
python -m models.qwen3.32b.qwen3_32b_decode -p a2a3 -d 0 # [RUN] PASS (20.15s) 'out' PASS shape=(16, 8192) dtype=torch.bfloat16同 chip / 同 CANN / 同 simpler / 同 pto-isa / 同 PTOAS / 同 Python。
chip + runtime 没毛病,是 step3p5 特有 IR 模式触发的。
定位:dispatch-limit bisect
通过在生成的
chip_orch.cpp里把晚于某个 K 的rt_submit_*_task(K, ...)调用注释掉再编 .so,
LIMIT=11是 PASS↔FAIL 的精确分水岭,对应单一新增的rt_submit_aiv_task(11, params_t11)=full_head_gate(per-rankhead-wise sigmoid gate,source 在
pypto-lib/models/step3p5/attention_full.py:564-586,外层pl.spmd(BATCH // BATCH_TILE)+ 内层pl.range(NUM_HEADS_FULL_LOCAL))。trace harness 文件:
pypto-lib/tools/p15_trace/run_with_trace.py,通过
P15_DISPATCH_LIMIT环境变量切换 dispatch 上限。harness sits atcompile_single_orchestration钩子,不需要重编 simpler runtime。排除路径(耗时但确凿)
chip_orch.cpp已无MixedKernels,hash 不变libhccl.so加载已实地确认nm -D libhost_runtime.so已无SdmaWorkspaceManager符号;plog 里 AICPU 0x2a 在 AICore 0x800 之后 1900 ms 出现,是 cascade 不是 causepos = ctx_len-1underflow on dummyseq_lens=0step3p5_decode.py:552已设seq_lens=ones,pos=0永远 in-boundspl.parallel(dynamic)vspl.parallel(static)pl.parallel(BATCH=16)加b_safeclamp,hash 不变full_rope_kv_cachekernel body我们想请上游做的事
aicore_kernel_0_mix_aic反汇编(140920 字节,源simpler/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cppsimpler/src/a2a3/platform/onboard/aicore/kernel.cpp)和 PTOAS 为step3p5 reproducer 编出来的
next_levels/chip_orch/kernels/aiv/full_head_gate.cpp反汇编。从 plog 的
pc current找到那条未对齐的 VEC 指令。prologue/epilogue、还是 (c)
full_head_gatebody 内部。绕开。
参考产物(按需要可提供)
/tmp/p15_devlog3/debug/plog/plog-*.log/tmp/p15_npu_d0/DecodeLayerDense_*/next_levels/chip_orch/pypto-lib/models/step3p5/{attention_full,attention_swa,decode_layer,prefill_attention_full,prefill_attention_swa}.py(已 push 到
csy0225/pypto-lib:feat/step3p5-phase-a-split-scope,draft PR 已开 =
hw-native-sys/pypto-lib#510)docs/step3p5/phases/15-singlerank-npu.md,重点读"Phase A execution status (2026-06-11, end of session)" 和
"Simpler-runtime kernel identification (2026-06-11, follow-up)" 两节