Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions docs/maintenance-operations/node-drain-coordination.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: "Kubernetes Node Drain Coordination"
description: "How the Simplyblock operator automatically protects storage availability during Kubernetes node maintenance such as cordon, drain, and rolling OS upgrades."
weight: 10800
---

When a Kubernetes worker node is cordoned or drained — for example during a rolling OS upgrade or node replacement —
the Simplyblock operator automatically coordinates the shutdown and restart of the backend storage node running on
that worker. No manual intervention is required.

Concurrency is controlled by `StorageCluster.spec.maxFaultTolerance`. At most that many workers may be inside the
active drain window at once, preventing the cluster from entering a degraded state during bulk maintenance.

## How It Works

When the operator detects that a worker node has become cordoned, it executes the following sequence:

1. Create a PodDisruptionBudget to prevent premature pod eviction.
2. Call the Simplyblock shutdown API for the backend storage node and wait until `offline`.
3. Relax the PDB to allow pod eviction — Kubernetes can now drain the worker.
4. Wait for the worker to return to a ready, uncordoned state.
5. Call the Simplyblock restart API and wait until `online` and cluster `rebalancing` is `false`.
6. Mark drain coordination `complete` and remove the PDB.

!!! warning
If another worker is already in the drain window and `maxFaultTolerance` would be exceeded, the operator holds
the new worker in the `detected` phase until an in-progress drain completes.

## Drain Phases

Each worker being drained progresses through the following phases, tracked in
`StorageNode.status.drainCoordination`:

| Phase | Description |
|-------------------|-----------------------------------------------------------------------------|
| `detected` | Worker is cordoned; waiting for a drain slot within `maxFaultTolerance`. |
| `shutdown_called` | Backend shutdown API has been called; waiting for `offline`. |
| `draining` | Shutdown confirmed; PDB relaxed — Kubernetes may evict pods. |
| `restart_called` | Worker is back; backend restart API has been called; waiting for `online`. |
| `complete` | Node is back online and cluster rebalancing has finished. |
| `failed` | An unrecoverable error occurred; manual intervention may be required. |

## Monitoring Drain State

```bash title="Inspect drain coordination status"
kubectl get storagenode simplyblock-node -n simplyblock \
-o jsonpath='{.status.drainCoordination}' | jq .
```

```bash title="Stream live changes"
kubectl get storagenode simplyblock-node -n simplyblock -w
```

## Configuring Fault Tolerance

Set `spec.maxFaultTolerance` on the `StorageCluster` resource to control how many workers can be simultaneously
inside the drain window:

```yaml title="Example: allow one worker in the drain window at a time"
spec:
maxFaultTolerance: 1
```

A value of `1` is the safest default. Increase it only if your erasure coding scheme and replication factor can
tolerate multiple simultaneous node outages without data unavailability.
168 changes: 168 additions & 0 deletions docs/maintenance-operations/operator-cluster-operations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
---
title: "Cluster and Node Operations via the Kubernetes Operator"
description: "How to perform lifecycle operations on a Simplyblock storage cluster and its nodes using the Kubernetes operator and Custom Resource Definitions."
weight: 10750
---

When Simplyblock is deployed on Kubernetes, cluster and node lifecycle operations are performed by patching the
`StorageCluster` and `StorageNode` Custom Resources rather than using the CLI directly. The operator picks up the
change, calls the backend API, polls for the expected terminal state, and records the result in `.status.actionStatus`.

!!! info
For CLI-based node operations on non-Kubernetes deployments, see
[Stopping and Manually Restarting a Storage Node](manual-restarting-nodes.md).

## StorageCluster Actions

Trigger a cluster-wide action by patching `spec.action` on the `StorageCluster` resource. Only one action runs at
a time. The operator sets `.status.actionStatus.state` to `running` while the action is in progress and to
`success` or `failed` when it completes.

### Shutdown

```bash title="Shut down the storage cluster"
kubectl patch storagecluster simplyblock-cluster -n simplyblock \
--type=merge -p '{"spec": {"action": "shutdown"}}'
```

The operator calls the backend shutdown API and polls until the cluster reports `suspended`.

### Start

```bash title="Start a suspended storage cluster"
kubectl patch storagecluster simplyblock-cluster -n simplyblock \
--type=merge -p '{"spec": {"action": "start"}}'
```

The operator calls the backend start API and polls until the cluster reports `active`.

### Restart

```bash title="Restart the storage cluster"
kubectl patch storagecluster simplyblock-cluster -n simplyblock \
--type=merge -p '{"spec": {"action": "restart"}}'
```

Runs shutdown → waits for `suspended` → runs start → waits for `active`. The current sub-phase is stored in
`.status.actionStatus.message`.

### Activate

```bash title="Activate a newly created cluster"
kubectl patch storagecluster simplyblock-cluster -n simplyblock \
--type=merge -p '{"spec": {"action": "activate"}}'
```

The operator calls the backend activate API and waits until the cluster reports `active`.

### Expand

```bash title="Finalize a cluster expansion"
kubectl patch storagecluster simplyblock-cluster -n simplyblock \
--type=merge -p '{"spec": {"action": "expand"}}'
```

The operator calls the backend expand API and waits until the cluster returns to `active`.

!!! info
To add new worker nodes to the storage fabric first, see
[Expanding a Storage Cluster](scaling/expanding-storage-cluster.md).

### Node Recycle

Node recycle sequentially restarts every backend storage node in the cluster. Use it after updating the storage-node
container image or changing node configuration.

```bash title="Recycle all storage nodes"
kubectl patch storagecluster simplyblock-cluster -n simplyblock \
--type=merge -p '{"spec": {"action": "node-recycle"}}'
```

To also refresh the storage-node DaemonSet pod on each worker after shutdown and before restart — for example when
rolling out a new container image — add `nodeRecycle.refreshSNodeAPI: true`:

```bash title="Recycle all storage nodes and refresh DaemonSet pods"
kubectl patch storagecluster simplyblock-cluster -n simplyblock \
--type=merge -p '{"spec": {"action": "node-recycle", "nodeRecycle": {"refreshSNodeAPI": true}}}'
```

For each backend storage node the operator executes:

1. Shut down the node and wait until `offline` or `in_restart`.
2. If `refreshSNodeAPI: true`, restart the DaemonSet pod and wait for the storage-node API to become reachable.
3. Restart the node and wait until `online`.
4. Wait until cluster `rebalancing` is `false`.
5. Proceed to the next node.

Progress is tracked in `.status.actionStatus` and `.status.nodeRecycleStatus`:

```bash title="Watch node recycle progress"
kubectl get storagecluster simplyblock-cluster -n simplyblock \
-o jsonpath='{.status.nodeRecycleStatus}' | jq .
```

## StorageNode Actions

Direct operations on individual backend storage nodes are triggered by patching `spec.action` and `spec.nodeUUID`
on the `StorageNode` resource. Both fields are required together — CRD validation rejects an `action` without a
`nodeUUID`.

```bash title="Restart a specific storage node"
kubectl patch storagenode simplyblock-node -n simplyblock \
--type=merge -p '{
"spec": {
"action": "restart",
"nodeUUID": "<node-uuid>"
}
}'
```

After the action completes, clear `spec.action` and `spec.nodeUUID` from the CR — the operator does not clear them
automatically.

### Supported Actions and Terminal States

| Action | Expected backend state after success |
|------------|------------------------------------------------|
| `shutdown` | `offline` |
| `restart` | `online` |
| `suspend` | `suspended` |
| `resume` | `online` |
| `remove` | node no longer present; `404` treated as success |

### Restart with Worker Relocation

For a `restart` action, two additional fields are available:

| Field | Type | Description |
|------------------|------|-------------|
| `workerNode` | string | Kubernetes worker to restart the node on. The operator labels the worker and waits for the storage-node API to become reachable before triggering restart. |
| `reattachVolume` | bool | Reattach volumes during restart where the backend supports it. |
| `force` | bool | Force the action where supported by the backend. |

## Monitoring Action Progress

### Watch cluster action state

```bash title="Get current action status"
kubectl get storagecluster simplyblock-cluster -n simplyblock \
-o jsonpath='{.status.actionStatus}' | jq .
```

```bash title="Stream live status changes"
kubectl get storagecluster simplyblock-cluster -n simplyblock -w
```

### Read backend cluster status

```bash title="Get backend lifecycle status"
kubectl get storagecluster simplyblock-cluster -n simplyblock \
-o jsonpath='{.status.status}{"\n"}'
```

### Inspect individual node states

```bash title="Get all storage node states"
kubectl get storagenode simplyblock-node -n simplyblock \
-o jsonpath='{.status.nodes}' | jq .
```
30 changes: 30 additions & 0 deletions docs/maintenance-operations/scaling/expanding-storage-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,36 @@ Once all newly added nodes are healthy/ready, finalize the expansion:

After the expansion is complete, the cluster returns to **ACTIVE** and resumes normal operation mode.

## Adding Worker Nodes with the Kubernetes Operator

When running Simplyblock on Kubernetes, add new worker nodes to the storage fabric by appending them to
`StorageNode.spec.workerNodes`:

```bash title="Add worker nodes via the operator"
kubectl patch storagenode simplyblock-node -n simplyblock \
--type=json -p '[
{"op":"add","path":"/spec/workerNodes/-","value":"new-node-4"},
{"op":"add","path":"/spec/workerNodes/-","value":"new-node-5"}
]'
```

The operator deploys the storage-node DaemonSet to the new workers, registers them with the Simplyblock backend,
and waits for each node to come online. The backend transitions to **IN_EXPANSION** during this process.

Once the nodes are online, finalize the expansion using the `StorageCluster` action:

```bash title="Finalize expansion via the operator"
kubectl patch storagecluster simplyblock-cluster -n simplyblock \
--type=merge -p '{"spec": {"action": "expand"}}'
```

Monitor progress:

```bash title="Watch expansion status"
kubectl get storagecluster simplyblock-cluster -n simplyblock \
-o jsonpath='{.status.status}{"\n"}' -w
```

```plain title="Example output for finalizing cluster expansion"
[demo@demo ~]# {{ cliname }} cluster complete-expand e2cda3fe-e9f2-42ce-bb2d-eecd10f58ccf
2026-02-19 11:28:49,995: 139892426475328: INFO: Connecting to remote_jm_af8d10c1-6613-47a9-8ed0-ebdf1f873738
Expand Down
Loading