Skip to content

fix(atelet): set OCI CgroupsPath so each actor gets its own cgroup#161

Open
ArnaudBger (ArnaudBger) wants to merge 1 commit into
agent-substrate:mainfrom
ArnaudBger:fix/atelet-actor-cgroups-path
Open

fix(atelet): set OCI CgroupsPath so each actor gets its own cgroup#161
ArnaudBger (ArnaudBger) wants to merge 1 commit into
agent-substrate:mainfrom
ArnaudBger:fix/atelet-actor-cgroups-path

Conversation

@ArnaudBger
Copy link
Copy Markdown

@ArnaudBger ArnaudBger (ArnaudBger) commented Jun 3, 2026

Fixes #50

Issue

When the OCI spec leaves Linux.CgroupsPath empty, runsc uses whatever cgroup it inherited from its parent process and does not record it as one it owns. At teardown, runsc only attempts Rmdir on paths it owns:

// runsc/cgroup/cgroup_v2.go — Uninstall
for i := len(c.Own) - 1; i >= 0; i-- {
    current := c.Own[i]
    // unix.Rmdir(current) with backoff on EBUSY
}

That makes the failure mode asymmetric and that is why it is hard to spot:

  • Non-owner actors hit Uninstall with an empty c.Own, the loop body never runs, the function returns nil — no log, no error.
  • The owner actor actually calls Rmdir on the shared cgroup, and gets EBUSY because the non-owners' processes are still in cgroup.procs. This is the only log line you ever see.

From logs alone it looks like one specific actor is broken — when in reality the bug is shared-cgroup ownership across actors.

How to reproduce

  1. Start actor A with no Linux.CgroupsPath. runsc creates the pause cgroup; A owns it.
  2. Start actor B in the same sandbox; B attaches processes to A's pause cgroup.
  3. Suspend or delete A. runsc calls Rmdir on the pause path.
  4. Rmdir returns EBUSY (B is still in cgroup.procs), surfacing removing cgroup path "/sys/fs/cgroup/pause": device or resource busy and leaving stale state — matching the failure in Actor stuck in STATUS_SUSPENDING #50.

Solution

Set a relative, per-actor path:

CgroupsPath: path.Join("actors", actorTemplateNamespace, actorTemplateName, actorID, containerName),
  • Relative → runsc resolves it under its own current cgroup (the sandbox pod's cgroup), so per-actor usage rolls up into the pod's kubelet/cAdvisor accounting.
  • Unique per actor → each container lives in its own directory, ends up in c.Own, and is created/cleaned up cleanly with no cross-actor EBUSY.
  • Layout follows the kubectl <namespace>/<name> convention (actors/<ns>/<template>/<actor>/<container>), so per-namespace and per-template totals are queryable at each level.

Spec construction is extracted into buildSpec so the new behavior is unit-testable.

@google-cla
Copy link
Copy Markdown

google-cla Bot commented Jun 3, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@ArnaudBger ArnaudBger (ArnaudBger) force-pushed the fix/atelet-actor-cgroups-path branch from e662fc3 to 7d3e5cd Compare June 3, 2026 06:44
@ArnaudBger ArnaudBger (ArnaudBger) force-pushed the fix/atelet-actor-cgroups-path branch from 7d3e5cd to 3164926 Compare June 4, 2026 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Actor stuck in STATUS_SUSPENDING

1 participant