You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
scope_stats (scope-statistics profiling) drops its already-recorded records
when a run hangs (AICPU scheduler-hang / AICore op-timeout). When the AICPU is
reaped before it flushes its scope_stats buffer, the host's reconcile_counters detects the un-flushed buffer ("device flush failed") but
only logs it — the records are lost. Give scope_stats the same
abnormal-path recovery that tensor dump just got (#1034 / #1035), so a hung run
still yields the scope records collected up to the hang.
Motivation / Use Case
scope_stats is a diagnostic; like tensor dump, it is most valuable exactly when
a run misbehaves — yet today its already-recorded records are dropped on the
hang path:
Host-side recovery of an un-flushed device buffer is NOT.
scope_stats's reconcile (scope_stats_collector.cpp, the
"un-flushed buffer ... device flush failed" path) has the same
detection-only behaviour that tensor dump's reconcile had before Add: tensor dump on the abnormal path (hang / op-timeout) (a5 + a2a3) #1035, and
only logs. When the AICPU is reaped before flushing — or when orchestration
hangs before the orchestrator-phase flush
(scope_stats_aicpu_flush_buffers() in aicpu_executor.cpp) — those records
never reach disk.
Proposed API / Behavior
Mirror the tensor-dump fix (#1035) for scope_stats; no user-facing API change:
In scope_stats reconcile_counters, when current_buf_ptr != 0 with a
non-empty buffer, recover those records directly from the device buffer
(before the device force-reset) into the collected set instead of just
logging, so write_jsonl exports them.
(Optional) Also flush the scope_stats partial buffer on the
scheduler-timeout exit path for an AICPU-stall hang, so the device delivers
cleanly when it still can.
Additional Context
Related: #1034 / #1035 — tensor-dump abnormal-path dump + host-side recovery;
this issue applies the same pattern to the scope_stats collector. #996 is a
separate scope_stats accounting bug, not this.
Summary
scope_stats(scope-statistics profiling) drops its already-recorded recordswhen a run hangs (AICPU scheduler-hang / AICore op-timeout). When the AICPU is
reaped before it flushes its scope_stats buffer, the host's
reconcile_countersdetects the un-flushed buffer ("device flush failed") butonly logs it — the records are lost. Give scope_stats the same
abnormal-path recovery that tensor dump just got (#1034 / #1035), so a hung run
still yields the scope records collected up to the hang.
Motivation / Use Case
scope_stats is a diagnostic; like tensor dump, it is most valuable exactly when
a run misbehaves — yet today its already-recorded records are dropped on the
hang path:
teardown_shared_collectors_after_run()exports scope_stats on the op-timeoutreturn after Add: tensor dump on the abnormal path (hang / op-timeout) (a5 + a2a3) #1035 (it runs
stop()→reconcile_counters()→write_jsonl()).scope_stats's reconcile (
scope_stats_collector.cpp, the"un-flushed buffer ... device flush failed" path) has the same
detection-only behaviour that tensor dump's reconcile had before Add: tensor dump on the abnormal path (hang / op-timeout) (a5 + a2a3) #1035, and
only logs. When the AICPU is reaped before flushing — or when orchestration
hangs before the orchestrator-phase flush
(
scope_stats_aicpu_flush_buffers()inaicpu_executor.cpp) — those recordsnever reach disk.
Proposed API / Behavior
Mirror the tensor-dump fix (#1035) for scope_stats; no user-facing API change:
reconcile_counters, whencurrent_buf_ptr != 0with anon-empty buffer, recover those records directly from the device buffer
(before the device force-reset) into the collected set instead of just
logging, so
write_jsonlexports them.scheduler-timeout exit path for an AICPU-stall hang, so the device delivers
cleanly when it still can.
Additional Context
Related: #1034 / #1035 — tensor-dump abnormal-path dump + host-side recovery;
this issue applies the same pattern to the scope_stats collector. #996 is a
separate scope_stats accounting bug, not this.