feat(autoscaler): deploy node-agent on nodes missing the grouping label#389
Conversation
The node-agent autoscaler grouped nodes by NodeGroupLabel (default node.kubernetes.io/instance-type) and skipped any node that did not carry it, so on on-prem/custom clusters no node-agent DaemonSet was created and the autoscaler effectively did nothing. Add a configurable DefaultNodeGroup (default "default", empty = strict skip). Label-less nodes now fall into this default group, flagged via NodeGroup.IsDefault and exposed to the template as IsDefaultGroup. Since a nodeSelector cannot match an absent label, the default group's DaemonSet is targeted with a "DoesNotExist" node affinity instead (the Helm chart provides the template branch). Sizing is unchanged (first node's allocatable per group). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughAdds a configurable default node group for unlabeled nodes, threads that state into DaemonSet rendering and reconciliation, and updates tests, docs, and CI checks for the new scheduling behavior. ChangesDefaultNodeGroup feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 OpenGrep (1.23.0)nodeagentautoscaler/autoscaler_test.go┌──────────────┐ �[32m✔�[39m �[1mOpengrep OSS�[0m [00.15][ERROR]: unable to find a config; path nodeagentautoscaler/nodegrouper.go┌──────────────┐ �[32m✔�[39m �[1mOpengrep OSS�[0m [00.16][ERROR]: unable to find a config; path 🔧 markdownlint-cli2 (0.22.1)docs/node-agent-autoscaler.mdmarkdownlint-cli2 wrapper config was not available before execution Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
nodeagentautoscaler/integration_test.go (1)
110-111: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAssert the Helm default branch keeps
kubernetes.io/os: linux.The in-memory renderer test already checks this, but this integration test does not. If the real Helm template drops the OS selector in the default-group branch, this test still passes while the DaemonSet can target unsupported nodes.
Proposed assertion
assert.Equal(t, "node-agent-default", ds.Name) // Must not pin to an instance-type value (the nodes lack the label). + assert.Equal(t, "linux", ds.Spec.Template.Spec.NodeSelector["kubernetes.io/os"]) assert.NotContains(t, ds.Spec.Template.Spec.NodeSelector, "node.kubernetes.io/instance-type")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nodeagentautoscaler/integration_test.go` around lines 110 - 111, The integration test for the DaemonSet template currently only verifies it does not pin to an instance-type label, but it also needs to assert the default Helm branch preserves the OS selector. Update the checks in nodeagentautoscaler integration_test around the DaemonSet spec so it explicitly confirms ds.Spec.Template.Spec.NodeSelector still contains kubernetes.io/os with linux when using the default-group path, alongside the existing instance-type assertion.nodeagentautoscaler/autoscaler_test.go (1)
358-365: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAssert
IsDefaulton the fallback group.Right now this only proves the unlabeled nodes were counted under
"default". IfGetNodeGroupsstops settingNodeGroup.IsDefault, the default-affinity path can still stay green because the renderer tests constructNodeGroup{IsDefault: true}manually. Please assert the flag here as part of the grouper contract.Proposed test assertion
require.Contains(t, byLabel, "default") assert.Equal(t, 2, byLabel["default"].NodeCount) + assert.True(t, byLabel["default"].IsDefault) require.Contains(t, byLabel, "m5.large") assert.Equal(t, 1, byLabel["m5.large"].NodeCount) + assert.False(t, byLabel["m5.large"].IsDefault)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nodeagentautoscaler/autoscaler_test.go` around lines 358 - 365, The fallback group check in `TestGetNodeGroups` only verifies the `"default"` label and node count, so it can miss regressions where `GetNodeGroups` stops marking the group as default. Update the assertions around the `byLabel["default"]` entry to also verify its `IsDefault` field is true, using the existing `GetNodeGroups` result and `NodeGroup` struct contract as the locator.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nodeagentautoscaler/autoscaler_test.go`:
- Line 343: The test setup in the autoscaler test is using the deprecated fake
client constructor, so update the client initialization in the relevant test
helper to use fake.NewClientset(nodes...) instead of
fake.NewSimpleClientset(nodes...). Keep the rest of the test logic unchanged and
make sure the autoscaler test continues to build against the client-go version
in use.
In `@nodeagentautoscaler/nodegrouper.go`:
- Around line 82-114: The fallback handling in nodegrouper.go’s grouping logic
is merging unlabeled nodes with real nodes whose NodeGroupLabel matches
DefaultNodeGroup, which collapses distinct populations into one NodeGroup.
Update the grouping in the node-group aggregation path so default-fallback nodes
are tracked separately from actual label values, and only set IsDefault for the
fallback group created from missing labels. Use the existing symbols NodeGroup,
groupMap, labelValue, isDefault, and RenderDaemonSet to ensure labeled nodes
keep their normal affinity path while fallback nodes continue using the
DoesNotExist branch.
---
Nitpick comments:
In `@nodeagentautoscaler/autoscaler_test.go`:
- Around line 358-365: The fallback group check in `TestGetNodeGroups` only
verifies the `"default"` label and node count, so it can miss regressions where
`GetNodeGroups` stops marking the group as default. Update the assertions around
the `byLabel["default"]` entry to also verify its `IsDefault` field is true,
using the existing `GetNodeGroups` result and `NodeGroup` struct contract as the
locator.
In `@nodeagentautoscaler/integration_test.go`:
- Around line 110-111: The integration test for the DaemonSet template currently
only verifies it does not pin to an instance-type label, but it also needs to
assert the default Helm branch preserves the OS selector. Update the checks in
nodeagentautoscaler integration_test around the DaemonSet spec so it explicitly
confirms ds.Spec.Template.Spec.NodeSelector still contains kubernetes.io/os with
linux when using the default-group path, alongside the existing instance-type
assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a0c66bf3-643d-4849-918b-6a67eb1bb12c
📒 Files selected for processing (8)
config/config.goconfig/config_test.godocs/node-agent-autoscaler.mdnodeagentautoscaler/autoscaler_test.gonodeagentautoscaler/integration_test.gonodeagentautoscaler/nodegrouper.gonodeagentautoscaler/templaterenderer.gonodeagentautoscaler/templaterenderer_test.go
matthyx
left a comment
There was a problem hiding this comment.
Review — blockers
Nice feature and the test coverage is solid. Two issues below look like real blockers before merge, plus one CI-coverage concern.
🔴 Blocker 1 — the default-group key collides with a real label value
nodeagentautoscaler/nodegrouper.go (~L81–115). Label-less nodes are bucketed under groupMap[ng.config.DefaultNodeGroup] ("default" by default). But that is the same keyspace as real label values, so a node that legitimately carries node.kubernetes.io/instance-type=default lands in the same group, and:
group.IsDefault = group.IsDefault || isDefaultflips the merged group onto the affinity path. The rendered DaemonSet then uses the DoesNotExist affinity, which by definition does not match the legitimately-labeled default node — so that node silently loses its node-agent. The two populations also share resource sizing (first node's allocatable per group), which can mis-size the DaemonSet.
The inline comment // A label value can be shared by labelled and label-less nodes shows the collision was noticed, but OR-ing IsDefault is the wrong resolution — it sacrifices the labeled node. The fallback population should be tracked in a separate bucket (e.g. a sentinel key that can't be a real label value) so a real default label and the synthetic fallback never merge. This overlaps with CodeRabbit's actionable comment on the same lines.
🔴 Blocker 2 — the DoesNotExist affinity needs the label key, but TemplateData only carries the value
nodeagentautoscaler/templaterenderer.go L24–29, L220. TemplateData.NodeGroupLabel is set to group.LabelValue (the value, e.g. m5.large / default), and no field carries the configured nodeGroupLabel key. So the companion Helm template can only hardcode node.kubernetes.io/instance-type in the DoesNotExist expression — exactly what the unit/integration tests do.
nodeGroupLabel is configurable (config/config.go), so if an operator overrides it, the default-group affinity still keys off node.kubernetes.io/instance-type. It then targets the wrong/absent label and the default group matches nothing (or the wrong nodes) — defeating the whole purpose of the feature for the exact custom/on-prem clusters it's meant to serve. Suggest passing the label key through TemplateData so the affinity is built from the configured key rather than a hardcoded constant.
🟡 Concern — the integration test silently skips, so the guarantee isn't enforced in CI
nodeagentautoscaler/integration_test.go (~L82–89). Both TestIntegration_HelmGeneratedTemplate* t.Skip when /tmp/test-daemonset-template.yaml is absent. In CI that file won't exist, so the "end-to-end against the actual Helm-generated template" guarantee in the PR description isn't actually verified — the default-group affinity (and the hardcoded-key issue above) could regress with everything green. Consider checking in a representative rendered template, or generating it in the CI job, so this test really runs.
Minor (non-blocking)
fake.NewSimpleClientset(...)is deprecated;fake.NewClientset(...)is the current constructor (per CodeRabbit).- CodeRabbit's nitpicks to assert
IsDefaultinTestNodeGrouper_GetNodeGroups_DefaultNodeGroupand to assertkubernetes.io/os: linuxin the integration default-group path are worth taking — they harden the grouper/affinity contract.
|
Summary:
|
|
🔗 Companion PR: kubescape/helm-charts#868 — adds the DaemonSet template branch ( |
Resolve the blockers and nitpicks from #389 review: - Track label-less nodes under an internal sentinel bucket so a node legitimately labelled "<groupingLabel>=default" forms its own normal (nodeSelector-targeted) group and is never merged onto the affinity path. Salt the fallback group's collision hash so it cannot collapse onto a real same-named group. - Key DaemonSet reconciliation (existence + orphan deletion) off the unique DaemonSet name instead of the node-group label value, which is not unique once the default group shares a value with a real group. - Pass the configured grouping label key through TemplateData (NodeGroupLabelKey) so the nodeSelector and the default group's DoesNotExist affinity key off the operator's actual grouping label rather than a hardcoded constant; the Helm template consumes it as the single source. - Run the integration tests in CI: extract the real DaemonSet template from the helm-charts chart so the tests run instead of skipping. - Tests: assert IsDefault on the grouper contract, assert os=linux survives in the default branch, add the =default collision test and a Reconcile create/orphan test; switch to fake.NewClientset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
|
Thanks for the review @matthyx — addressed in 🔴 Blocker 1 — default-group key collides with a real label value 🔴 Blocker 2 — affinity needs the label key, but TemplateData only had the value 🟡 Concern — integration tests silently skip in CI Minor: switched to |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nodeagentautoscaler/autoscaler.go (1)
193-236: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winEnforce the rendered-name contract before using name-based bookkeeping.
Reconcile()now treatsGenerateDaemonSetName(group)as the resource identity, but the DaemonSet name still comes from a configurable template. If a custom template renders any othermetadata.name, this code will reconcile one name and orphan-delete another on the next loop. Please either validatedesiredDS.Name == dsNameand fail fast, or switch the existence/orphan tracking to the rendered name instead of the precomputed one.Suggested guard
desiredDS, err := a.templateRenderer.RenderDaemonSet(group, resources) if err != nil { logger.L().Error("failed to render DaemonSet", helpers.String("nodeGroup", group.LabelValue), helpers.Error(err)) continue } + if desiredDS.Name != dsName { + return fmt.Errorf("rendered DaemonSet name %q does not match expected %q", desiredDS.Name, dsName) + }Also applies to: 254-255
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nodeagentautoscaler/autoscaler.go` around lines 193 - 236, Reconcile() currently uses GenerateDaemonSetName(group) for processed/existing bookkeeping while RenderDaemonSet can produce a different metadata.name, so name-based tracking may target the wrong DaemonSet. In the Reconcile loop, either validate that desiredDS.Name matches dsName and stop with an error if it does not, or change the create/update/orphan cleanup logic to consistently use the rendered name from desiredDS.Name throughout the DaemonSet handling flow. Ensure the same identity is used wherever existingByName and processedNames are consulted.
🧹 Nitpick comments (1)
nodeagentautoscaler/autoscaler_test.go (1)
629-694: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd a reconcile test for the real-
defaultcollision path.
TestNodeGrouper_GetNodeGroups_DefaultValueCollisionproves the groups stay distinct, but the regression fixed here was inReconcile()'s name-based existence/orphan bookkeeping. A test with one node labeled...=defaultplus unlabeled nodes would catch a future change that collapses one of the two DaemonSets during update/delete.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nodeagentautoscaler/autoscaler_test.go` around lines 629 - 694, Add a reconcile coverage for the real “default” collision path by extending the autoscaler reconcile test around NewAutoscaler and Reconcile. Use one node explicitly labeled with the node-group label set to default plus at least one unlabeled node so both the real default group and the fallback default group exist at once, then assert Reconcile creates and preserves two distinct DaemonSets instead of collapsing one during name-based existence/orphan handling. Keep the check focused on the unique DaemonSet names and the managed orphan cleanup logic in Reconcile.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/pr-created.yaml:
- Around line 41-55: The workflow job is checking out kubescape/helm-charts at
main, which can break this validation when the required Helm template change
only exists in the companion branch. Update the checkout in the pr-created
workflow to use the companion branch or pinned SHA for the stack, or add gating
so the integration step in the Extract node-agent daemonset template / Build and
run integration tests flow only runs when that ref is available.
- Around line 38-43: Disable credential persistence on the extra checkout step
in the Helm charts fetch. Update the actions/checkout usage for the Checkout
helm-charts step to turn off credential persistence, since this repository clone
is read-only. Keep the existing checkout configuration for kubescape/helm-charts
in place, but add the checkout setting that prevents the workflow token from
being written into helm-charts/.git/config.
In `@nodeagentautoscaler/templaterenderer.go`:
- Around line 226-229: The DaemonSet labels in the node agent template currently
use the shared node group label, which can collide between the fallback default
group and a real default group. Update the DaemonSet selector and pod labels in
templaterenderer.go, especially where the template uses the group-derived name
fields, so each DaemonSet is uniquely identified by a discriminator like Name
instead of kubescape.io/node-group. Also add a test that covers the
duplicated-default scenario and verifies the selectors no longer collide.
---
Outside diff comments:
In `@nodeagentautoscaler/autoscaler.go`:
- Around line 193-236: Reconcile() currently uses GenerateDaemonSetName(group)
for processed/existing bookkeeping while RenderDaemonSet can produce a different
metadata.name, so name-based tracking may target the wrong DaemonSet. In the
Reconcile loop, either validate that desiredDS.Name matches dsName and stop with
an error if it does not, or change the create/update/orphan cleanup logic to
consistently use the rendered name from desiredDS.Name throughout the DaemonSet
handling flow. Ensure the same identity is used wherever existingByName and
processedNames are consulted.
---
Nitpick comments:
In `@nodeagentautoscaler/autoscaler_test.go`:
- Around line 629-694: Add a reconcile coverage for the real “default” collision
path by extending the autoscaler reconcile test around NewAutoscaler and
Reconcile. Use one node explicitly labeled with the node-group label set to
default plus at least one unlabeled node so both the real default group and the
fallback default group exist at once, then assert Reconcile creates and
preserves two distinct DaemonSets instead of collapsing one during name-based
existence/orphan handling. Keep the check focused on the unique DaemonSet names
and the managed orphan cleanup logic in Reconcile.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b616c6a2-25a9-4892-9304-8f87e11cd13f
📒 Files selected for processing (8)
.github/workflows/pr-created.yamldocs/node-agent-autoscaler.mdnodeagentautoscaler/autoscaler.gonodeagentautoscaler/autoscaler_test.gonodeagentautoscaler/integration_test.gonodeagentautoscaler/nodegrouper.gonodeagentautoscaler/templaterenderer.gonodeagentautoscaler/templaterenderer_test.go
✅ Files skipped from review due to trivial changes (1)
- docs/node-agent-autoscaler.md
🚧 Files skipped from review as they are similar to previous changes (1)
- nodeagentautoscaler/integration_test.go
| - name: Checkout helm-charts | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| repository: kubescape/helm-charts | ||
| ref: main | ||
| path: helm-charts |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Disable credential persistence on the extra checkout.
This checkout is read-only, so persisting the workflow token into helm-charts/.git/config is unnecessary and creates an avoidable leak path if that workspace is ever uploaded or reused.
Suggested hardening
- name: Checkout helm-charts
uses: actions/checkout@v4
with:
repository: kubescape/helm-charts
ref: main
path: helm-charts
+ persist-credentials: false📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Checkout helm-charts | |
| uses: actions/checkout@v4 | |
| with: | |
| repository: kubescape/helm-charts | |
| ref: main | |
| path: helm-charts | |
| - name: Checkout helm-charts | |
| uses: actions/checkout@v4 | |
| with: | |
| repository: kubescape/helm-charts | |
| ref: main | |
| path: helm-charts | |
| persist-credentials: false |
🧰 Tools
🪛 zizmor (1.26.1)
[warning] 38-43: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false
(artipacked)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/pr-created.yaml around lines 38 - 43, Disable credential
persistence on the extra checkout step in the Helm charts fetch. Update the
actions/checkout usage for the Checkout helm-charts step to turn off credential
persistence, since this repository clone is read-only. Keep the existing
checkout configuration for kubescape/helm-charts in place, but add the checkout
setting that prevents the workflow token from being written into
helm-charts/.git/config.
Source: Linters/SAST tools
| repository: kubescape/helm-charts | ||
| ref: main | ||
| path: helm-charts | ||
| - uses: azure/setup-helm@v4 | ||
| - name: Extract node-agent daemonset template | ||
| run: | | ||
| helm template test helm-charts/charts/kubescape-operator \ | ||
| --kube-version 1.29.0 \ | ||
| --set nodeAgent.autoscaler.enabled=true --set clusterName=test \ | ||
| | grep -A 400 "daemonset-template.yaml:" | tail -n +2 | sed 's/^ //' \ | ||
| | awk '/^---/{exit} {print}' > /tmp/test-daemonset-template.yaml | ||
| test -s /tmp/test-daemonset-template.yaml || { echo "template extraction produced an empty file"; exit 1; } | ||
| - name: Build and run integration tests | ||
| working-directory: operator | ||
| run: go test -tags=integration ./... |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Pin the chart checkout to the companion change instead of main.
This job now validates behavior that depends on the companion Helm template branch, so checking out kubescape/helm-charts@main makes the PR fail until that other PR merges. Point this at the companion branch/SHA for the stack, or gate the integration job until that ref is available.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/pr-created.yaml around lines 41 - 55, The workflow job is
checking out kubescape/helm-charts at main, which can break this validation when
the required Helm template change only exists in the companion branch. Update
the checkout in the pr-created workflow to use the companion branch or pinned
SHA for the stack, or add gating so the integration step in the Extract
node-agent daemonset template / Build and run integration tests flow only runs
when that ref is available.
| Name: fmt.Sprintf("node-agent-%s", group.SanitizedName), | ||
| NodeGroupLabel: group.LabelValue, | ||
| NodeGroupLabelKey: tr.nodeGroupLabelKey, | ||
| IsDefaultGroup: group.IsDefault, |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect rendered/template selector usage around NodeGroupLabel.
rg -n -C5 '(\.NodeGroupLabel|kubescape\.io/node-group|matchLabels:|selector:)' \
--glob '!**/vendor/**' \
--glob '!**/node_modules/**' \
.Repository: kubescape/operator
Length of output: 20311
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Map the renderer and the nearby tests to inspect selector/name usage.
ast-grep outline nodeagentautoscaler/templaterenderer.go --view expanded
printf '\n--- templaterenderer.go (relevant slice) ---\n'
sed -n '1,320p' nodeagentautoscaler/templaterenderer.go | nl -ba | sed -n '180,260p'
printf '\n--- templaterenderer_test.go (selector slices) ---\n'
sed -n '1,460p' nodeagentautoscaler/templaterenderer_test.go | nl -ba | sed -n '60,230p'Repository: kubescape/operator
Length of output: 1671
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n--- nodeagentautoscaler/templaterenderer.go: RenderDaemonSet and GenerateDaemonSetName ---\n'
sed -n '221,320p' nodeagentautoscaler/templaterenderer.go
printf '\n--- nodeagentautoscaler/nodegrouper.go: grouping/default logic ---\n'
sed -n '80,150p' nodeagentautoscaler/nodegrouper.go
printf '\n--- nodeagentautoscaler/templaterenderer_test.go: relevant template assertions ---\n'
sed -n '60,240p' nodeagentautoscaler/templaterenderer_test.goRepository: kubescape/operator
Length of output: 10634
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find the configured template path and any real DaemonSet template files.
rg -n -C3 'TemplatePath|node-agent.*template|kubescape.io/node-group|app.kubernetes.io/name: node-agent|selector:' \
nodeagentautoscaler docs .github . \
--glob '!**/vendor/**' \
--glob '!**/node_modules/**'
printf '\n--- candidate files ---\n'
fd -a '.*template.*\.ya?ml$|.*daemonset.*\.ya?ml$|.*node-agent.*\.ya?ml$' .Repository: kubescape/operator
Length of output: 23281
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '40,80p' .github/workflows/pr-created.yaml | cat -nRepository: kubescape/operator
Length of output: 1061
🌐 Web query:
kubescape helm-charts node-agent autoscaler daemonset-template.yaml NodeGroupLabel selector Name kubescape.io/node-group
💡 Result:
In the Kubescape Helm charts, particularly within the kubescape-operator chart, the daemonset-template.yaml file serves as a shared template used for rendering Node Agent DaemonSets, especially when features like the operator-based autoscaler or multiple node-agent configurations are enabled [1][2]. Regarding the specific node label selector mentioned: The label kubescape.io/node-group is used as a dynamic placeholder within the daemonset-template.yaml [2]. Specifically, when the system renders the DaemonSet metadata, it includes: labels: kubescape.io/node-group: "{{{{ .NodeGroupLabel }}}}" This allows the Kubescape Operator to associate specific DaemonSet instances with defined node groups or node pools [2][3]. This is particularly relevant when using the nodeAgent.multipleDaemonSets feature, which enables administrators to deploy tailored Node Agent configurations (with specific resource requests/limits and scheduling constraints) to different node pools [3][4]. Key takeaways: - Shared Template: daemonset-template.yaml provides a single source of truth for the DaemonSet specification, used by multiple components within the operator to maintain consistency [2]. - Autoscaler Integration: The template includes logic to ensure that static DaemonSets and the automatic node-agent autoscaler do not conflict, with the template sometimes rendering in autoscaler-managed modes [1][2]. - Node Affinity: While the template handles the labeling for the object itself (using kubescape.io/node-group), the actual scheduling onto specific nodes is typically managed via nodeSelectors configured in the Helm values (nodeAgent.multipleDaemonSets.configurations[].nodeSelector) [3].
Citations:
- 1: Node-agent daemonset autoscaler support helm-charts#772
- 2: kubescape/helm-charts@kubescape-operator-1.30.0...kubescape-operator-1.30.1
- 3: https://kubescape.io/docs/operator/multiple-node-agent-per-node-pool/
- 4: https://kubescape.io/docs/install-operator/
🌐 Web query:
site:github.com/kubescape/helm-charts daemonset-template.yaml kubescape.io/node-group selector matchLabels app.kubernetes.io/name node-agent
💡 Result:
The daemonset-template.yaml file in the Kubescape Operator Helm chart is a template used specifically when the Kubescape Node Agent autoscaler mode is enabled [1][2]. This template allows the operator to dynamically manage Node Agent DaemonSets based on specific node groups [2]. The selector matchLabels within this context includes both standard identification labels and a specialized label for the autoscaler: 1. app.kubernetes.io/name: node-agent (typically included via the kubescape-operator.selectorLabels template helper) [2]. 2. kubescape.io/node-group: A dynamic label defined by the {{ .NodeGroupLabel }} placeholder, which maps to the configuration specified in the operator's values (e.g., nodeAgent.autoscaler.nodeGroupLabel) [2][3]. This configuration ensures that when the autoscaler is active, the generated DaemonSets are correctly identified and associated with the specific node groups they are intended to monitor [1][2]. In standard (non-autoscaler) deployments, the Node Agent is typically deployed as a static DaemonSet defined in node-agent/daemonset.yaml [1].
Citations:
- 1: Node-agent daemonset autoscaler support helm-charts#772
- 2: kubescape/helm-charts@kubescape-operator-1.30.0...kubescape-operator-1.30.1
- 3: kubescape/helm-charts@kubescape-operator-1.40.0...kubescape-operator-1.40.1
Use a unique selector label for each DaemonSet
kubescape.io/node-group can be shared by both the fallback default group and a real default group, so spec.selector.matchLabels/pod labels should use a unique discriminator like .Name instead. Add a collision test for the duplicated-default case.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nodeagentautoscaler/templaterenderer.go` around lines 226 - 229, The
DaemonSet labels in the node agent template currently use the shared node group
label, which can collide between the fallback default group and a real default
group. Update the DaemonSet selector and pod labels in templaterenderer.go,
especially where the template uses the group-derived name fields, so each
DaemonSet is uniquely identified by a discriminator like Name instead of
kubescape.io/node-group. Also add a test that covers the duplicated-default
scenario and verifies the selectors no longer collide.
matthyx
left a comment
There was a problem hiding this comment.
Re-review (with companion helm-charts #868 in mind)
Looked at #868 to resolve the cross-repo questions from my first pass. One correction and one sharpened blocker.
✅ Withdrawn — integration test "silently skips" concern
That's by design: the CI check renders the real chart template from helm-charts and the test consumes it, so it only goes green once #868 is available. Not a gap. #868 also keeps kubernetes.io/os: linux on the default group and keys the affinity off the configured label (.NodeGroupLabelKey), so the OS-selector and key-vs-value points from my first review are handled on the helm side — provided the operator supplies that field, which is the next item.
🔴 Blocker (sharpened) — this PR doesn't provide .NodeGroupLabelKey, so renders will fail once #868 merges
#868 renders {{ .NodeGroupLabelKey }} in both branches of the operator-consumed template:
{{- if not .IsDefaultGroup }}
{{ .NodeGroupLabelKey }}: "{{ .NodeGroupLabel }}" # normal nodeSelector
{{- end }}
...
{{- if .IsDefaultGroup }}
- key: {{ .NodeGroupLabelKey }} # DoesNotExist affinity
operator: DoesNotExist
{{- end }}
But TemplateData (nodeagentautoscaler/templaterenderer.go L20–33) only defines Name, NodeGroupLabel, IsDefaultGroup, Resources, GoMemLimit — there is no NodeGroupLabelKey. RenderDaemonSet builds the struct (L214–221) without it, and the renderer runs tr.template.Execute(&buf, data) (L237) with no missingkey leniency (and missingkey wouldn't help anyway — it only applies to maps, not struct field access).
So as soon as #868 merges and the operator renders the real template, Execute errors with can't evaluate field NodeGroupLabelKey in type templaterenderer.TemplateData — and this fails for every group (the normal nodeSelector branch references it too), not just the default group. No autoscaler-managed DaemonSet would render. This is exactly why the CI check waits on #868; with the PR as-is it would go red.
Fix: add NodeGroupLabelKey string to TemplateData and set it from the configured key in RenderDaemonSet, e.g. NodeGroupLabelKey: tr.config.NodeGroupLabel (or thread the key in via the constructor / NodeGrouper). The two PRs need to merge together with this field present.
🔴 Blocker (still stands) — default-group key collides with a real label value
Independent of helm. In nodeagentautoscaler/nodegrouper.go (~L81–115) label-less nodes bucket under groupMap[DefaultNodeGroup] ("default"), and group.IsDefault ||= isDefault flips a group to the default path if any member lacks the label. A node legitimately labeled node.kubernetes.io/instance-type=default lands in that same bucket — and #868's default branch drops the instance-type nodeSelector in favor of DoesNotExist, so that genuinely-labeled node is no longer matched and loses its node-agent (and resource sizing mixes the two populations). Track the synthetic fallback in a separate bucket that can't collide with a real label value.
|
Summary:
|
The integration-build job rendered the chart from a fresh helm-charts checkout, but the chart has local file:// subchart dependencies (kubescape-alert-crd, kubescape-alert-crd-ns) that must be vendored first. Run `helm dependency build` before `helm template` so the extraction produces a non-empty template instead of failing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
matthyx
left a comment
There was a problem hiding this comment.
Re-review — both blockers resolved ✅
Verified the two new commits (75ea4f5, a96a01c) against the blockers from the earlier review. Built clean and go test ./nodeagentautoscaler/... ./config/... passes locally.
🔴→✅ Default-group key collision — label-less nodes now bucket under an internal sentinel (defaultGroupKey = "\x00default-node-group", nodegrouper.go) that can't equal a real label value, so a node legitimately labelled …=default forms its own normal nodeSelector group and is never merged onto the affinity path. The collision-resolution hash is salted for the fallback group so the two same-LabelValue groups get distinct DaemonSet names. Good catch also keying Reconcile off the unique DaemonSet name instead of the (now non-unique) kubescape.io/node-group label value — that would otherwise have collapsed the default and real default groups during create/orphan handling.
🔴→✅ Missing NodeGroupLabelKey — TemplateData.NodeGroupLabelKey is now plumbed through NewTemplateRenderer(..., cfg.NodeGroupLabel) and set in RenderDaemonSet, so the template's {{ .NodeGroupLabelKey }} (both the nodeSelector and the default group's DoesNotExist affinity, per helm-charts #868) resolves and keys off the operator's actual grouping label even when overridden.
Withdrawn concern — the integration test now runs in CI (pr-created.yaml extracts the real chart template), with helm dependency build added in a96a01c to vendor the file:// subcharts.
One remaining gate (merge ordering, not a code issue)
The CI job checks out kubescape/helm-charts@main, but the matching template branch (.NodeGroupLabelKey + IsDefaultGroup) lives in unmerged helm-charts #868. Until #868 is on main, the extraction pulls the old template and TestIntegration_HelmGeneratedTemplate_DefaultGroup will fail (old template renders an instance-type nodeSelector and no DoesNotExist affinity). So #868 must merge first (or together) for this PR's CI to go green — exactly the dependency already noted. No further code changes needed on the operator side.
|
Summary:
|
|
Pod selector collision for the The PR carefully keeps the synthetic default group distinct from a real group labelled Node targeting is mutually exclusive (label-absent vs. Suggestion: derive the |
|
Default group resource sizing is pinned to the first-listed node — riskier for this group than for instance-type groups
For instance-type groups this is exact (homogeneous by definition). The default group is the opposite case: nodes land there because they're unlabeled, which on-prem/bare-metal is precisely the mixed-hardware scenario (e.g. a 2-core utility box and a 64-core worker in one bucket). Whichever node Consequence:
Since the silent-skip bug this PR fixes is itself an on-prem/heterogeneous-cluster scenario, consider sizing the default group off the minimum allocatable across its nodes (a safe lower bound) instead of the arbitrary first one. |
Address slashben's review on #389: the default (label-less) group can be heterogeneous (mixed on-prem/bare-metal hardware), so sizing it off the first-listed node is unsafe and order-dependent. Size it off the minimum allocatable CPU/memory across its nodes instead, so the node-agent's requests fit every node in the group regardless of node list order. Instance-type groups are homogeneous and keep first-node sizing. Also document the rare shared-selector edge case (a node legitimately labelled <groupingLabel>=<defaultNodeGroup>): targeting is mutually exclusive and ownerReferences prevent pod adoption, so it is safe; noted for operators. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
|
Thanks @slashben — both points addressed in Default-group sizing → minimum allocatable. Agreed; the default group is exactly the heterogeneous case. Pod-selector collision ( |
|
Summary:
|
Problem
The node-agent autoscaler groups nodes by
NodeGroupLabel(defaultnode.kubernetes.io/instance-type) and skipped any node missing that label. On on-prem / custom clusters where the label isn't populated, nonode-agentDaemonSet was created — the autoscaler effectively did nothing.Change
nodeAgentAutoscaler.defaultNodeGroup(default"default"; empty string preserves the strict skip behavior).NodeGroup.IsDefaultand exposed to the template asIsDefaultGroup.nodeSelectorcannot match an absent label, so the default group's DaemonSet is targeted with aDoesNotExistnode affinity instead. The matching template branch lives in the Helm chart (see companion PR).Companion PR
Requires the Helm chart template change: kubescape/helm-charts (DaemonSet template
IsDefaultGroupbranch).Tests
DoesNotExistaffinity + no instance-type selector.docs/node-agent-autoscaler.md).🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests