simplyblock · boddumanohar · May 15, 2026
diff --git a/docs/maintenance-operations/node-drain-coordination.md b/docs/maintenance-operations/node-drain-coordination.md
@@ -0,0 +1,65 @@
+---
+title: "Kubernetes Node Drain Coordination"
+description: "How the Simplyblock operator automatically protects storage availability during Kubernetes node maintenance such as cordon, drain, and rolling OS upgrades."
+weight: 10800
+---
+
+When a Kubernetes worker node is cordoned or drained — for example during a rolling OS upgrade or node replacement —
+the Simplyblock operator automatically coordinates the shutdown and restart of the backend storage node running on
+that worker. No manual intervention is required.
+
+Concurrency is controlled by `StorageCluster.spec.maxFaultTolerance`. At most that many workers may be inside the
+active drain window at once, preventing the cluster from entering a degraded state during bulk maintenance.
+
+## How It Works
+
+When the operator detects that a worker node has become cordoned, it executes the following sequence:
+
+1. Create a PodDisruptionBudget to prevent premature pod eviction.
+2. Call the Simplyblock shutdown API for the backend storage node and wait until `offline`.
+3. Relax the PDB to allow pod eviction — Kubernetes can now drain the worker.
+4. Wait for the worker to return to a ready, uncordoned state.
+5. Call the Simplyblock restart API and wait until `online` and cluster `rebalancing` is `false`.
+6. Mark drain coordination `complete` and remove the PDB.
+
+!!! warning
+    If another worker is already in the drain window and `maxFaultTolerance` would be exceeded, the operator holds
+    the new worker in the `detected` phase until an in-progress drain completes.
+
+## Drain Phases
+
+Each worker being drained progresses through the following phases, tracked in
+`StorageNode.status.drainCoordination`:
+
+| Phase             | Description                                                                 |
+|-------------------|-----------------------------------------------------------------------------|
+| `detected`        | Worker is cordoned; waiting for a drain slot within `maxFaultTolerance`.    |
+| `shutdown_called` | Backend shutdown API has been called; waiting for `offline`.                |
+| `draining`        | Shutdown confirmed; PDB relaxed — Kubernetes may evict pods.                |
+| `restart_called`  | Worker is back; backend restart API has been called; waiting for `online`.  |
+| `complete`        | Node is back online and cluster rebalancing has finished.                   |
+| `failed`          | An unrecoverable error occurred; manual intervention may be required.       |
+
+## Monitoring Drain State
+
+```bash title="Inspect drain coordination status"
+kubectl get storagenode simplyblock-node -n simplyblock \
+  -o jsonpath='{.status.drainCoordination}' | jq .
+```
+
+```bash title="Stream live changes"
+kubectl get storagenode simplyblock-node -n simplyblock -w
+```
+
+## Configuring Fault Tolerance
+
+Set `spec.maxFaultTolerance` on the `StorageCluster` resource to control how many workers can be simultaneously
+inside the drain window:
+
+```yaml title="Example: allow one worker in the drain window at a time"
+spec:
+  maxFaultTolerance: 1
+```
+
+A value of `1` is the safest default. Increase it only if your erasure coding scheme and replication factor can
+tolerate multiple simultaneous node outages without data unavailability.
diff --git a/docs/maintenance-operations/operator-cluster-operations.md b/docs/maintenance-operations/operator-cluster-operations.md
@@ -0,0 +1,168 @@
+---
+title: "Cluster and Node Operations via the Kubernetes Operator"
+description: "How to perform lifecycle operations on a Simplyblock storage cluster and its nodes using the Kubernetes operator and Custom Resource Definitions."
+weight: 10750
+---
+
+When Simplyblock is deployed on Kubernetes, cluster and node lifecycle operations are performed by patching the
+`StorageCluster` and `StorageNode` Custom Resources rather than using the CLI directly. The operator picks up the
+change, calls the backend API, polls for the expected terminal state, and records the result in `.status.actionStatus`.
+
+!!! info
+    For CLI-based node operations on non-Kubernetes deployments, see
+    [Stopping and Manually Restarting a Storage Node](manual-restarting-nodes.md).
+
+## StorageCluster Actions
+
+Trigger a cluster-wide action by patching `spec.action` on the `StorageCluster` resource. Only one action runs at
+a time. The operator sets `.status.actionStatus.state` to `running` while the action is in progress and to
+`success` or `failed` when it completes.
+
+### Shutdown
+
+```bash title="Shut down the storage cluster"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "shutdown"}}'
+```
+
+The operator calls the backend shutdown API and polls until the cluster reports `suspended`.
+
+### Start
+
+```bash title="Start a suspended storage cluster"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "start"}}'
+```
+
+The operator calls the backend start API and polls until the cluster reports `active`.
+
+### Restart
+
+```bash title="Restart the storage cluster"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "restart"}}'
+```
+
+Runs shutdown → waits for `suspended` → runs start → waits for `active`. The current sub-phase is stored in
+`.status.actionStatus.message`.
+
+### Activate
+
+```bash title="Activate a newly created cluster"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "activate"}}'
+```
+
+The operator calls the backend activate API and waits until the cluster reports `active`.
+
+### Expand
+
+```bash title="Finalize a cluster expansion"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "expand"}}'
+```
+
+The operator calls the backend expand API and waits until the cluster returns to `active`.
+
+!!! info
+    To add new worker nodes to the storage fabric first, see
+    [Expanding a Storage Cluster](scaling/expanding-storage-cluster.md).
+
+### Node Recycle
+
+Node recycle sequentially restarts every backend storage node in the cluster. Use it after updating the storage-node
+container image or changing node configuration.
+
+```bash title="Recycle all storage nodes"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "node-recycle"}}'
+```
+
+To also refresh the storage-node DaemonSet pod on each worker after shutdown and before restart — for example when
+rolling out a new container image — add `nodeRecycle.refreshSNodeAPI: true`:
+
+```bash title="Recycle all storage nodes and refresh DaemonSet pods"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "node-recycle", "nodeRecycle": {"refreshSNodeAPI": true}}}'
+```
+
+For each backend storage node the operator executes:
+
+1. Shut down the node and wait until `offline` or `in_restart`.
+2. If `refreshSNodeAPI: true`, restart the DaemonSet pod and wait for the storage-node API to become reachable.
+3. Restart the node and wait until `online`.
+4. Wait until cluster `rebalancing` is `false`.
+5. Proceed to the next node.
+
+Progress is tracked in `.status.actionStatus` and `.status.nodeRecycleStatus`:
+
+```bash title="Watch node recycle progress"
+kubectl get storagecluster simplyblock-cluster -n simplyblock \
+  -o jsonpath='{.status.nodeRecycleStatus}' | jq .
+```
+
+## StorageNode Actions
+
+Direct operations on individual backend storage nodes are triggered by patching `spec.action` and `spec.nodeUUID`
+on the `StorageNode` resource. Both fields are required together — CRD validation rejects an `action` without a
+`nodeUUID`.
+
+```bash title="Restart a specific storage node"
+kubectl patch storagenode simplyblock-node -n simplyblock \
+  --type=merge -p '{
+    "spec": {
+      "action": "restart",
+      "nodeUUID": "<node-uuid>"
+    }
+  }'
+```
+
+After the action completes, clear `spec.action` and `spec.nodeUUID` from the CR — the operator does not clear them
+automatically.
+
+### Supported Actions and Terminal States
+
+| Action     | Expected backend state after success           |
+|------------|------------------------------------------------|
+| `shutdown` | `offline`                                      |
+| `restart`  | `online`                                       |
+| `suspend`  | `suspended`                                    |
+| `resume`   | `online`                                       |
+| `remove`   | node no longer present; `404` treated as success |
+
+### Restart with Worker Relocation
+
+For a `restart` action, two additional fields are available:
+
+| Field            | Type | Description |
+|------------------|------|-------------|
+| `workerNode`     | string | Kubernetes worker to restart the node on. The operator labels the worker and waits for the storage-node API to become reachable before triggering restart. |
+| `reattachVolume` | bool | Reattach volumes during restart where the backend supports it. |
+| `force`          | bool | Force the action where supported by the backend. |
+
+## Monitoring Action Progress
+
+### Watch cluster action state
+
+```bash title="Get current action status"
+kubectl get storagecluster simplyblock-cluster -n simplyblock \
+  -o jsonpath='{.status.actionStatus}' | jq .
+```
+
+```bash title="Stream live status changes"
+kubectl get storagecluster simplyblock-cluster -n simplyblock -w
+```
+
+### Read backend cluster status
+
+```bash title="Get backend lifecycle status"
+kubectl get storagecluster simplyblock-cluster -n simplyblock \
+  -o jsonpath='{.status.status}{"\n"}'
+```
+
+### Inspect individual node states
+
+```bash title="Get all storage node states"
+kubectl get storagenode simplyblock-node -n simplyblock \
+  -o jsonpath='{.status.nodes}' | jq .
+```
diff --git a/docs/maintenance-operations/scaling/expanding-storage-cluster.md b/docs/maintenance-operations/scaling/expanding-storage-cluster.md
@@ -31,6 +31,36 @@ Once all newly added nodes are healthy/ready, finalize the expansion:
 
 After the expansion is complete, the cluster returns to **ACTIVE** and resumes normal operation mode.
 
+## Adding Worker Nodes with the Kubernetes Operator
+
+When running Simplyblock on Kubernetes, add new worker nodes to the storage fabric by appending them to
+`StorageNode.spec.workerNodes`:
+
+```bash title="Add worker nodes via the operator"
+kubectl patch storagenode simplyblock-node -n simplyblock \
+  --type=json -p '[
+    {"op":"add","path":"/spec/workerNodes/-","value":"new-node-4"},
+    {"op":"add","path":"/spec/workerNodes/-","value":"new-node-5"}
+  ]'
+```
+
+The operator deploys the storage-node DaemonSet to the new workers, registers them with the Simplyblock backend,
+and waits for each node to come online. The backend transitions to **IN_EXPANSION** during this process.
+
+Once the nodes are online, finalize the expansion using the `StorageCluster` action:
+
+```bash title="Finalize expansion via the operator"
+kubectl patch storagecluster simplyblock-cluster -n simplyblock \
+  --type=merge -p '{"spec": {"action": "expand"}}'
+```
+
+Monitor progress:
+
+```bash title="Watch expansion status"
+kubectl get storagecluster simplyblock-cluster -n simplyblock \
+  -o jsonpath='{.status.status}{"\n"}' -w
+```
+
 ```plain title="Example output for finalizing cluster expansion"
 [demo@demo ~]# {{ cliname }} cluster complete-expand e2cda3fe-e9f2-42ce-bb2d-eecd10f58ccf
 2026-02-19 11:28:49,995: 139892426475328: INFO: Connecting to remote_jm_af8d10c1-6613-47a9-8ed0-ebdf1f873738