Skip to content

[Bug] AICore VEC UB-not-aligned (507018) on step3p5 single-rank, while qwen3/32b passes on the same chip #1036

@csy0225

Description

@csy0225

[Bug] AICore VEC UB-not-aligned (507018) on step3p5 single-rank, while qwen3/32b passes on the same chip

Repository: hw-native-sys/simpler (primary) — likely also touches
hw-native-sys/pypto or hw-native-sys/ptoas codegen; refile to the right
component on triage.

Filed as: simpler#1036 (2026-06-12)

Severity: blocking — Phase 15 single-rank NPU bring-up cannot proceed,
and any TP=N follow-up will hit the same fault per-rank.

2026-06-15 update — kernel pinned to full_head_gate (task 11), not
full_rope_kv_cache.
A subsequent P15_DISPATCH_LIMIT bisect that
comments out late rt_submit_*_task(K, …) calls in the generated
chip_orch.cpp before the .so build flipped the result at exactly
LIMIT=11:

P15_DISPATCH_LIMIT dispatched tasks result
6 0..6 (rmsnorm → rope) PASS
10 0..10 (incl. fa_fused 4-spmd block, no head_gate) PASS
11 + full_head_gate FAIL 507018
12, 14, 22 progressively more FAIL (same signature)

The tslot:6 field in plog is an FFTS+ internal slot, NOT the
chip_orch task index — the apparent match to "task 6 =
full_rope_kv_cache" was a misread that this issue's first version
propagated. The actual culprit is dispatched at chip_orch task 11 (AIV)
= the per-rank head-wise sigmoid gate. This also matches the local
TASK-30 full_head_gate AIV0 stall entry in our backlog (CLAUDE.md);
507018 single-rank and TASK-30 are the same bug.

Reproducer pinned more precisely: pypto-lib/tools/p15_trace/run_with_trace.py
with P15_DISPATCH_LIMIT=10 PASSes, P15_DISPATCH_LIMIT=11 FAILs. The
trace harness sits at the compile_single_orchestration chokepoint;
no simpler-runtime rebuild required.

What this changes for the maintainer ask below: the disassembly
request is now scoped to the dispatched function for full_head_gate
(AIV, source pypto-lib/models/step3p5/attention_full.py:564-586
outer pl.spmd(BATCH // BATCH_TILE) + pl.range(NUM_HEADS_FULL_LOCAL)
head-wise sigmoid then assemble into attn_out_gated), not the rope
body. The rest of this issue's diagnosis (executor binary identification,
qwen3 counter-example, version pin matrix, what we tried) remains valid.

Supersedes the operating hypothesis in pypto-1702-followup.md (filed
2026-06-10 as pypto#1738 = "PR#1718 doesn't fix 507018 — SSA aliasing in
another path"). After this session's eight Phase A model-side mitigations
all failed to shift the fault hash, the SSA-aliasing theory is now ruled
out. The crashing kernel binary is shown below to be simpler runtime's
own AICore polling-dispatch executor
(binSize 140920 = exact match for
simpler/build/lib/a2a3/onboard/tensormap_and_ringbuffer/aicore_kernel.o),
so the fix path moves out of pypto codegen and into simpler/PTOAS.

Summary

Running step3p5's step3p5_decode -p a2a3 -d 0 --no-smoke --dummy-weights
crashes the chip ~22 ms after the first kernel dispatch with:

  • Host: aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
  • Plog: errcode 0x800 errorStr: The UB address accessed by the VEC instruction is not aligned. ... subErrType:4, tslot:6
  • fault kernel_name=aicore_kernel_0_mix_aic, hash=15033215677169261682, binSize=140920

That binSize=140920 matches exactly the simpler runtime's compiled
build/lib/a2a3/onboard/tensormap_and_ringbuffer/aicore_kernel.o (140920
bytes). The "fault kernel" is the simpler-runtime polling-dispatch executor
itself; the genuinely faulting VEC instruction lives in a function dispatched
via payload->function_bin_addr from execute_task (FFTS+ MIX stream,
tslot:6). CANN's PrintErrorInfoForDavinciTask reports the entry-point
binary, not the dispatched function, so the executor hash is constant
across all model-side mitigations we tried.

A counter-example in the same repo passes cleanly on the same chip / CANN /
simpler / pto-isa / PTOAS: see "Counter-example" below.

Reproducer

Single command from a --no-smoke --dummy-weights invocation against the
current pypto-lib/models/step3p5/ working tree:

# venv with pypto + simpler + pto-isa + ptoas installed
cd /path/to/pypto-lib
python -m models.step3p5.step3p5_decode \
    -p a2a3 -d 0 --no-smoke --dummy-weights

Observed (truncated):

[chip_process pid=… dev=0] ready
[ERROR] sync_run_streams: [device_runner_base.cpp:877]
        aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
[ERROR] recover_device_or_mark_unusable: [device_runner.cpp:456]
        Device unrecoverable after AICore error 507018:
        aclrtSynchronizeDeviceWithTimeout failed: 507015.
RuntimeError: WorkerThread::dispatch_process: child failed (code=1):
              chip_process dev=0: RuntimeError:
              run_prepared failed with code 507018

Plog (from ASCEND_PROCESS_LOG_PATH=$LOGDIR then find $LOGDIR -name 'plog-*.log'):

AllKernelRegister: Runtime_alloc_size 1240, type=0,
    kernel_name=aicore_kernel_0_mix_aic, tilingkey=0, offset=144,
    length=232, dfxAddr=0x0, dfxSize=0, kernelVfType=0, shareMemSize=0.
LaunchKernelWithHandle: kernel info : device_id=0, stream_id=45, task_id=0,
    kernelType=0, kernel_name=aicore_kernel_0_mix_aic,
    arg_size=8, mixType=3, taskRation=2, funcType=0,
    addr1=0x124000000090, addr2=0x12400000076c, flag=0, kernelFlag=0x0,
    qos=0, partId=0, schemMode=1, infoAddr=(nil), atomicIndex=0.
FillFftsMixSqeForDavinciTask: kernelNames_=aicore_kernel_0_mix_aic,
    stackSize=32768.
PrintCoreInfo: The extend info: errcode:(0, 0x800, 0)
    errorStr: The UB address accessed by the VEC instruction is not aligned.
    fixp_error0 info: 0xcc175f7, fixp_error1 info: 0x8e,
    fsmId:0, tslot:6, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.
PrintErrorInfoForDavinciTask: Aicore kernel execute failed, device_id=0,
    stream_id=45, report_stream_id=45, task_id=0, flip_num=0,
    fault kernel_name=aicore_kernel_0_mix_aic, fault kernel info ext=none,
    program id=0, hash=15033215677169261682.
GetBinAndKernelNameExceptionArgs: kernel_name=aicore_kernel_0_mix_aic,
    kernelNameSize=23, binSize=140920.

A subsequent PrintAicpuErrorInfo: funcName=simpler_aicpu_exec, errorCode=0x2a appears ~1900 ms later; this is a cascade from the
chip's fault state, not a primary cause (verified by timestamp ordering).

Counter-example

The qwen3-32b decode reference in the same pypto-lib, with the same
single-card harness, passes end-to-end in ~20 seconds:

cd /path/to/pypto-lib
python -m models.qwen3.32b.qwen3_32b_decode -p a2a3 -d 0
# [RUN] PASS (20.15s)
# [RUN]   'out' PASS  shape=(16, 8192) dtype=torch.bfloat16

Same chip / CANN / simpler / pto-isa / PTOAS / Python. The chip + runtime
stack itself is healthy. step3p5 specifically triggers the fault.

What is at tslot:6 in step3p5

Reading the compiled next_levels/chip_orch/orchestration/chip_orch.cpp,
task dispatch index 6 is full_rope_kv_cache — a per-batch AIV scope that
applies partial RoPE (rotary_dim=64 of head_dim=128, pass-through tail=64)
and writes K/V cache + a padded Q block. The relevant Python source is
pypto-lib/models/step3p5/attention_full.py:387-484 (scope 2).

Mapping addr to the fault PC

addr1=0x124000000090 and addr2=0x12400000076c are the chip-side virtual
addresses of the executor entry. Plog PrintCoreInfo reports
pc current: 0x12c0c00d9d9c, which is ~0xc00d9d0c bytes past addr1. The
executor binary is only 140920 bytes (= 0x22678), so the PC is not in
the executor — it is in a dispatched function loaded elsewhere in chip
address space (function_bin_addr from PTO2DispatchPayload). We were
unable to disassemble device-side binaries from inside the container (no
gdb / objdump for AICore), so we cannot localise the misaligned VEC at the
instruction level. Asking the maintainer team to do this localisation is
the primary purpose of this report.

Version pin table

Component HEAD Notes
Chip Ascend 910B2C (Short_SoC_version=Ascend910B) dav-c220-cube / dav-c220-vec, 24 AIC / 48 AIV / 6 AICPU per die
Driver npu-smi 25.5.1
CANN 9.0.0-beta.1 Also reproduced on 8.5.1 (verified live libhccl.so load)
Python 3.11.14 Reproduced on 3.10 too
simpler branch fix/tensor-zero-size-view-bounds:0cd317e7 (= PR #1023 plus host-side --no-as-needed link patch + comm_hccl.cpp P2P best-effort) Also reproduces on main:afb5c5a9
pto-isa main:109c9f72
PTOAS binary v0.44 (source main:29a8af28) Also reproduces on v0.43
pypto main:0f4881cb (post PR#1718 merge)
pypto-lib main:9c5593fb + step3p5 working-tree WIP qwen3/32b passes against the same SHA

The constancy of hash=15033215677169261682, binSize=140920, addr1=0x124000000090, addr2=0x12400000076c, tslot:6 across all eight
model-side mitigations below (which change every named step3p5 kernel) is
decisive evidence that the faulting kernel is not in step3p5 Python
source — it is in code reachable through simpler's polling-dispatch
executor.

What we tried (all leave the failure unchanged)

  1. Phase A split-scope refactor of full_fa_fused into 4 sequential
    spmds (full_qk_matmul AIC → full_softmax AIV → full_sv_matmul AIC →
    full_online_softmax AIV), mirroring qwen3/32b's pattern. Eliminates the
    mixed AIC+AIV dispatch entirely. chip_orch.cpp verified — no remaining
    MixedKernels groups. → same hash, same tslot:6.
  2. Split full_out_proj into pure-cube matmul + pure-vec cast via FP32
    GM scratch handoff. → same hash.
  3. Split dense_gate_up_silu_tp and dense_down_proj_tp the same way.
    → same hash.
  4. SWA mirror of (1)+(2) in attention_swa.py. → same hash.
  5. Full-row cast + overwrite RoPE idiom for full_rope_kv_cache K and Q
    writes (qwen3/32b uses this; replaces a pl.add(k_pass, 0.0) workaround
    that was previously masking a different compile-time error). → same hash.
  6. Rewrite full_rmsnorm_zc from pl.spmd(BATCH//BATCH_TILE=1) + pl.range to qwen3/32b's pl.at(level=pl.Level.CORE_GROUP) + pl.pipeline(stage=4) form. → same hash.
  7. pl.parallel(user_batch)pl.parallel(BATCH) (dynamic loop bound
    replaced with static Python constant). Tested with defensive b_safe = pl.min(b, user_batch-1) clamp. → same hash.
  8. TP=1 monkey-patch path (--tp-world-size 1) which takes a different
    code path in step3p5_decode.py:351-381. → same hash.

All eight changes are compile-clean (smoke probe rc=0) and structurally
match the working qwen3/32b form in the same repo. None move the fault.

Rule-out matrix

Suspect Outcome Evidence
Mixed AIC+AIV dispatch Ruled out All MixedKernels removed via (1)–(4); chip_orch.cpp has only rt_submit_aic_task / rt_submit_aiv_task
pypto#1693 / PR#1718 (multi-output spmd SSA aliasing) Ruled out PR #1718 merged on pod (pypto:0f4881cb); no effect on this fault
CANN version Ruled out Reproduces on 8.5.1 and 9.0.0-beta.1
Python version Ruled out Reproduces on 3.10 and 3.11
PTOAS version Ruled out Reproduces on v0.43 and v0.44
SDMA workspace (aclnnShmemSdmaStarsQuery) AICPU 0x2a Ruled out SIMPLER_ENABLE_PTO_SDMA_WORKSPACE=OFF already in effect; nm -D libhost_runtime.so returns zero SdmaWorkspaceManager symbols; AICPU 0x2a in the log is a cascade ~1900 ms after the AICore 0x800
TP=8 canonical vs TP=1 monkey-patch Ruled out Same hash on both code paths
Driver support_shmem_map_exbus=0 Ruled out for single-rank This driver flag affects multi-rank aclrtIpcMemImportByKey (filed separately); single-rank reproducer here uses no cross-card IPC
Dummy input out-of-bounds (pos = ctx_len-1 underflow when seq_lens=0) Ruled out Current step3p5_decode.py:552 sets seq_lens = torch.ones(...) so pos = 0 is always in-bounds; slot_mapping = torch.arange(B) gives unique slots
pl.parallel(dynamic) vs pl.parallel(static) loop bound Ruled out (7) above

What we believe but cannot verify locally

The faulting VEC instruction is inside a model-kernel dispatched by
simpler's executor at the 7th FFTS+ MIX SQE slot. Strong candidate based on
the dispatch sequence we read out of chip_orch.cpp:

  • full_rope_kv_cache (per-batch loop, AIV) — its position in dispatch
    order matches tslot:6 directly.

The TileType::Vec declarations we read out of the generated
full_rope_kv_cache.cpp all use 32-byte-aligned widths (float[1,32],
bfloat16[1,32], float[8,32], etc.). We did not find a structural
alignment violation by inspection. The pattern that differs from qwen3/32b
and is unique to step3p5 is partial RoPE (ROTARY_HALF_FULL=32,
rotary_dim=64, pass-through=64) versus full RoPE (rotary_dim=128). Whether
PTOAS lowers the partial-RoPE pattern into an unaligned VEC is what we'd
like the codegen team to confirm.

What we are asking for

  1. Disassemble aicore_kernel.o (140920 bytes, source
    simpler/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
    • simpler/src/a2a3/platform/onboard/aicore/kernel.cpp) and the model
      kernels generated by PTOAS for the step3p5 reproducer at
      next_levels/chip_orch/kernels/aiv/full_rope_kv_cache.cpp. Find the VEC
      instruction at PC offset ~0xc00d9d0c from kernel base.
  2. Confirm whether the offending VEC originates in (a) the simpler
    executor, (b) a PTOAS-injected prologue/epilogue, or (c) the dispatched
    model kernel body.
  3. If (c), tell us which IR pattern lowered to it — we will then either
    refactor step3p5 to avoid that pattern or push a fix into PTOAS.
  4. If (a) or (b), the fix belongs upstream.

Related local artifacts (available on request)

  • Full plog: /tmp/p15_devlog3/debug/plog/plog-*.log (multiple runs across
    this session)
  • Compiled chip orchestrator: /tmp/p15_npu_d0/DecodeLayerDense_*/next_levels/chip_orch/
  • Working tree diffs (this session's eight Phase A refactors): under
    workspace/pypto-lib/models/step3p5/{attention_full,attention_swa,decode_layer}.py
  • Session diagnosis log: docs/step3p5/phases/15-singlerank-npu.md — read
    "Phase A execution status (2026-06-11, end of session)" and
    "Simpler-runtime kernel identification (2026-06-11, follow-up)"

See also

  • docs/upstream-issues/pypto-1702-followup.md — earlier hypothesis (now
    ruled out) that PR#1718 should fix this fault; filed as pypto#1738 on
    2026-06-10
  • docs/upstream-issues/simpler-comm-init-segfault.md — separate
    comm_init segfault, fixed via --no-as-needed link patch
    (simpler#1018)
  • docs/upstream-issues/step3p5-multirank-shmem-exbus.md — Phase 16 driver
    capability gap (filed jointly with this issue)

中文说明(2026-06-15 最新状态)

一句话总结

step3p5 单卡 decode bring-up 卡在 AICore errcode 0x800 "VEC UB not aligned"
导致的 507018,最新 bisect(P15_DISPATCH_LIMIT 阶梯)已把 fault 钉死在
chip_orch 第 11 号 task = full_head_gate(AIV),不是早期认定的
full_rope_kv_cache
。同 chip + 同 CANN 上 qwen3/32b decode 端到端跑通
20 秒,证明 chip / driver / runtime 健康;问题是 step3p5 在 full_head_gate
这个特定 kernel 触发了一条未对齐的 VEC 指令。

复现

# 在装好 pypto + simpler + pto-isa + ptoas 的 venv 里
cd <pypto-lib>
python -m models.step3p5.step3p5_decode -p a2a3 -d 0 --no-smoke --dummy-weights
# 期望: chip 在第一次 kernel dispatch 后 ~22 ms 崩溃
# host: aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
# plog: errcode 0x800 errorStr: "The UB address accessed by the VEC instruction is not aligned"
#       fault kernel_name=aicore_kernel_0_mix_aic, hash=15033215677169261682, binSize=140920

binSize=140920 字节精确等于 simpler runtime 自带的 polling-dispatch
executor simpler/build/lib/a2a3/onboard/tensormap_and_ringbuffer/aicore_kernel.o
所以"fault kernel"是 simpler 的 dispatch 跳板;真正 fault 的 VEC 指令在
被 dispatch 的 full_head_gate body 里。

反证:qwen3/32b 同环境通过

python -m models.qwen3.32b.qwen3_32b_decode -p a2a3 -d 0
# [RUN] PASS (20.15s)   'out' PASS  shape=(16, 8192) dtype=torch.bfloat16

同 chip / 同 CANN / 同 simpler / 同 pto-isa / 同 PTOAS / 同 Python。
chip + runtime 没毛病,是 step3p5 特有 IR 模式触发的。

定位:dispatch-limit bisect

通过在生成的 chip_orch.cpp 里把晚于某个 K 的 rt_submit_*_task(K, ...) 调用
注释掉再编 .so,LIMIT=11 是 PASS↔FAIL 的精确分水岭,对应单一新增的
rt_submit_aiv_task(11, params_t11) = full_head_gate(per-rank
head-wise sigmoid gate,source 在
pypto-lib/models/step3p5/attention_full.py:564-586,外层
pl.spmd(BATCH // BATCH_TILE) + 内层 pl.range(NUM_HEADS_FULL_LOCAL))。

trace harness 文件: pypto-lib/tools/p15_trace/run_with_trace.py
通过 P15_DISPATCH_LIMIT 环境变量切换 dispatch 上限。harness sits at
compile_single_orchestration 钩子,不需要重编 simpler runtime

排除路径(耗时但确凿)

排除目标 状态 证据
Mixed AIC+AIV dispatch Phase A 拆完 4 个 fa spmds + out_proj cube/cast 拆分 + dense MLP 拆分;chip_orch.cpp 已无 MixedKernels,hash 不变
pypto#1693 / PR#1718 (multi-output spmd SSA aliasing) PR#1718 已 ff-merge 进 pypto main;fault 不动
CANN 8.5.1 vs 9.0.0-beta.1 两个 CANN 版本都崩,libhccl.so 加载已实地确认
Python 3.10 vs 3.11 / PTOAS v0.43 vs v0.44 所有组合都崩
SDMA workspace AICPU 0x2a 级联 nm -D libhost_runtime.so 已无 SdmaWorkspaceManager 符号;plog 里 AICPU 0x2a 在 AICore 0x800 之后 1900 ms 出现,是 cascade 不是 cause
pos = ctx_len-1 underflow on dummy seq_lens=0 step3p5_decode.py:552 已设 seq_lens=onespos=0 永远 in-bounds
pl.parallel(dynamic) vs pl.parallel(static) 改成静态 pl.parallel(BATCH=16)b_safe clamp,hash 不变
full_rope_kv_cache kernel body rope kernel body 经 byte-wise diff vs 一个 PASS reference 已证明无差,即不是 rope kernel 本身的 bug

我们想请上游做的事

  1. 拿到 aicore_kernel_0_mix_aic 反汇编(140920 字节,源
    simpler/src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
    • simpler/src/a2a3/platform/onboard/aicore/kernel.cpp)和 PTOAS 为
      step3p5 reproducer 编出来的 next_levels/chip_orch/kernels/aiv/full_head_gate.cpp
      反汇编。从 plog 的 pc current 找到那条未对齐的 VEC 指令。
  2. 告知 fault 来源: (a) simpler executor 自身、(b) PTOAS 注入的
    prologue/epilogue、还是 (c) full_head_gate body 内部。
  3. 如果是 (c),告诉我们触发的 IR 模式,我们改 step3p5 model code
    绕开。
  4. 如果是 (a) 或 (b),修在上游。

参考产物(按需要可提供)

  • 完整 plog(多次失败运行): /tmp/p15_devlog3/debug/plog/plog-*.log
  • 编译产物: /tmp/p15_npu_d0/DecodeLayerDense_*/next_levels/chip_orch/
  • 工作树代码(含 8 处 Phase A 拆分尝试):
    pypto-lib/models/step3p5/{attention_full,attention_swa,decode_layer,prefill_attention_full,prefill_attention_swa}.py
    (已 push 到 csy0225/pypto-lib:feat/step3p5-phase-a-split-scope
    draft PR 已开 = hw-native-sys/pypto-lib#510
  • 本地诊断详记: docs/step3p5/phases/15-singlerank-npu.md,重点读
    "Phase A execution status (2026-06-11, end of session)" 和
    "Simpler-runtime kernel identification (2026-06-11, follow-up)" 两节

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions