Skip to content

[Feature] Recover scope_stats records on the abnormal path (hang / op-timeout) #1038

@ChaoZheng109

Description

@ChaoZheng109

Summary

scope_stats (scope-statistics profiling) drops its already-recorded records
when a run hangs (AICPU scheduler-hang / AICore op-timeout). When the AICPU is
reaped before it flushes its scope_stats buffer, the host's
reconcile_counters detects the un-flushed buffer ("device flush failed") but
only logs it — the records are lost. Give scope_stats the same
abnormal-path recovery that tensor dump just got (#1034 / #1035), so a hung run
still yields the scope records collected up to the hang.

Motivation / Use Case

scope_stats is a diagnostic; like tensor dump, it is most valuable exactly when
a run misbehaves — yet today its already-recorded records are dropped on the
hang path:

  • Host export on the error path is already covered
    teardown_shared_collectors_after_run() exports scope_stats on the op-timeout
    return after Add: tensor dump on the abnormal path (hang / op-timeout) (a5 + a2a3) #1035 (it runs stop()reconcile_counters()
    write_jsonl()).
  • Host-side recovery of an un-flushed device buffer is NOT.
    scope_stats's reconcile (scope_stats_collector.cpp, the
    "un-flushed buffer ... device flush failed" path) has the same
    detection-only behaviour that tensor dump's reconcile had before Add: tensor dump on the abnormal path (hang / op-timeout) (a5 + a2a3) #1035, and
    only logs. When the AICPU is reaped before flushing — or when orchestration
    hangs before the orchestrator-phase flush
    (scope_stats_aicpu_flush_buffers() in aicpu_executor.cpp) — those records
    never reach disk.

Proposed API / Behavior

Mirror the tensor-dump fix (#1035) for scope_stats; no user-facing API change:

  1. In scope_stats reconcile_counters, when current_buf_ptr != 0 with a
    non-empty buffer, recover those records directly from the device buffer
    (before the device force-reset) into the collected set instead of just
    logging, so write_jsonl exports them.
  2. (Optional) Also flush the scope_stats partial buffer on the
    scheduler-timeout exit path for an AICPU-stall hang, so the device delivers
    cleanly when it still can.

Additional Context

Related: #1034 / #1035 — tensor-dump abnormal-path dump + host-side recovery;
this issue applies the same pattern to the scope_stats collector. #996 is a
separate scope_stats accounting bug, not this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions