Skip to content

projectbeskar/virtrigaud

Repository files navigation

VirtRigaud

A Kubernetes operator for managing virtual machines across multiple hypervisors.

Version: v0.3.11 — CHANGELOG | Documentation

Overview

VirtRigaud is a Kubernetes operator that enables declarative management of virtual machines across different hypervisor platforms. It provides a unified API for provisioning and managing VMs on vSphere, Libvirt/KVM, and Proxmox VE through a remote gRPC provider architecture.

The manager reconciles Kubernetes custom resources; each hypervisor runs as a separate provider pod. Manager and provider pods communicate over gRPC. Credentials are scoped to the Provider CR and never flow through the manager.

Features

  • Multi-Hypervisor Support: vSphere, Libvirt/KVM, and Proxmox VE simultaneously
  • Cross-Provider VM Migration: Storage-backend-agnostic disk migration between any two hypervisors — S3 and NFS staging backends, validated across vSphere ⇄ Libvirt/KVM ⇄ Proxmox VE in both directions (ADR-0006). The disk is staged through object storage or an NFS export and moved with qemu-img; it never traverses a CSI PVC (PVC is compat-only). NFS uses qemu-img's native libnfs transport (kernel-mount on Proxmox). (v0.3.11)
  • VM Cloning (VMClone): Full and linked clones, MVP — source.vmRef, same-provider (vSphere/Proxmox/Libvirt; libvirt: qcow2 overlay for linked, full copy for full)
  • VMSet CRD defined; controller not yet active: Multi-VM replica set is defined but the controller is a stub that reports Ready=False / ControllerNotImplemented; rolling updates and replica management are roadmap
  • VMPlacementPolicy (reference-only): Placement rules (affinity, anti-affinity, resource constraints) expressed as a policy object referenced by VirtualMachine.spec.placementRef; no standalone enforcement controller
  • Declarative v1beta1 API: Stable CRDs with OpenAPI validation
  • Cloud-Init Support: Cross-provider VM initialisation via cloud-init
  • Power Management: On/Off/Reboot/Graceful-Shutdown uniformly
  • Async Task Tracking: Long-running vSphere and Proxmox operations tracked via TaskStatus RPC
  • Resource Reconfiguration: CPU, memory, disk changes (online for vSphere/Proxmox; online for Libvirt when VM was created with cpuHotAddEnabled/memoryHotAddEnabled, otherwise power-cycle)
  • G6 Circuit Breaker: One circuit breaker per Provider CR for automatic failure isolation (v0.3.6+)
  • Secure-by-default gRPC: mTLS wired end-to-end (TLS 1.3, SNI, certwatcher hot-reload); provider pods fail closed without credentials (#147/#148, v0.3.7)
  • Libvirt SSH host-key verification: known_hosts enforced by default; TOFU removed (#149, v0.3.7)
  • Observability: 11 virtrigaud_* Prometheus metric families (1 deprecated in v0.3.6; removal in v0.4.0)

Architecture

VirtRigaud uses a Remote Provider architecture for optimal scalability and reliability:

graph TB
    %% Kubernetes Cluster boundary
    subgraph "Kubernetes Cluster"

        %% CRDs
        subgraph "Custom Resources (v1beta1)"
            VM[VirtualMachine]
            VMC[VMClass]
            VMI[VMImage]
            PR[Provider]
            VMNA[VMNetworkAttachment]
            VMSN[VMSnapshot]
            VMSET[VMSet]
            VMMig[VMMigration]
            VMPP[VMPlacementPolicy]
            VMCL[VMClone]
        end

        %% Controller
        CTRL["VirtRigaud Manager
        (controller + G6 CB interceptor)"]

        %% Remote Providers
        subgraph "Remote Providers (gRPC)"
            VSP[vSphere Provider Pod]
            LVP[Libvirt Provider Pod]
            PXP[Proxmox Provider Pod]
        end

        %% Connections within cluster
        VM -.-> CTRL
        VMC -.-> CTRL
        VMI -.-> CTRL
        PR -.-> CTRL
        VMNA -.-> CTRL

        CTRL -->|"gRPC (G4 retry + G6 CB)"| VSP
        CTRL -->|"gRPC (G4 retry + G6 CB)"| LVP
        CTRL -->|"gRPC (G4 retry + G6 CB)"| PXP
    end

    %% External Infrastructure
    subgraph "External Infrastructure"
        subgraph "vSphere"
            VCENTER[vCenter Server]
        end
        subgraph "KVM"
            LIBVIRT[Libvirt Host]
        end
        subgraph "Proxmox VE"
            PVE[Proxmox Cluster]
        end
    end

    VSP -->|govmomi API| VCENTER
    LVP -->|libvirt+SSH| LIBVIRT
    PXP -->|REST API| PVE
Loading

Security status (v0.3.11)

The following issues were open in v0.3.6 and resolved in v0.3.7; they are closed in this release:

  • mTLS wired end-to-end (#147, v0.3.7): manager↔provider gRPC TLS is wired through the provider Resolver with cert/key/CA loaded, TLS 1.3, SNI, and certwatcher hot-reload. Provider servers require and verify client certificates. Exception: the libvirt provider uses plaintext gRPC to its sidecar container and separately enforces SSH known_hosts — this is a documented maintainer choice, not a defect.
  • Provider gRPC auth enforced, fail-closed (#148, v0.3.7): provider pods require TLS credentials and fail closed (crash-loop) at startup if credentials are absent, unless explicitly opted into insecure mode via the provider runtime config.
  • Libvirt SSH host-key verification ON by default (#149, v0.3.7): the no_verify=1 flag is removed; known_hosts is sourced from the credentials Secret. Trust-on-first-use (TOFU) is no longer the default.

Verify these controls are correctly configured before relying on them in regulated environments. For full security guidance, see the Security Operations Guide.

CRDs (10 total, all v1beta1)

CRD Short name Controller Description
VirtualMachine vm active A virtual machine instance
VMClass vmc active Resource profile (CPU, memory, disk)
VMImage vmi active Base template or image reference
VMNetworkAttachment vmna active Network configuration
Provider prov active Hypervisor connection + runtime config
VMMigration vmmig active Cross-provider VM migration
VMSnapshot active Snapshot lifecycle management
VMClone vmclone active (MVP) Cloning operations — MVP: source.vmRef source, same-provider, full & linked clones
VMSet vmset not yet active Multi-VM replica set — controller is a stub that reports Ready=False / ControllerNotImplemented
VMPlacementPolicy reference-only Placement rules (affinity, resources) — a policy object referenced by VirtualMachine.spec.placementRef; no standalone controller

Note: VMAdoption is a controller built into the manager, not a CRD.

Provider Feature Matrix

Per the canonical capabilities matrix, verified against provider GetCapabilities responses (v0.3.11: the NFS migration staging backend is implemented across all three providers, both directions — alongside the S3 backend; v0.3.9 added Libvirt Clone, ImagePrepare, online disk expansion, online reconfigure, and memory snapshots):

Feature vSphere Libvirt Proxmox Notes
Core Operations Create/Delete/Power/Describe
Reconfiguration Libvirt: online via setvcpus/setmem --live when VM was created with cpuHotAddEnabled/memoryHotAddEnabled (hotplug headroom provisioned at create, grows up to ~4× ceiling, vCPU hard cap 64); otherwise power-cycle (#203)
Disk Expansion Libvirt: online grow via virsh blockresize (grow-only; desired ≤ current is a no-op) + best-effort in-guest FS grow via guest agent (#201)
Snapshots Point-in-time captures
Memory Snapshots RAM-inclusive checkpoints for a running VM. vSphere: CreateSnapshot(memory=true). Libvirt: snapshot-create-as without --disk-only; a stopped VM is honestly downgraded to disk-only with a WARN (#202).
Cloning (full) Libvirt: full copy of resolved disk path (qemu-img convert / vol-clone), same-provider (#153)
Linked Clones Libvirt: qcow2 overlay (backing-file COW), same-provider (#153). UEFI/secure-boot nvram re-point is a deferred follow-up (#208).
Clone RPC Libvirt Clone implemented: linked (qcow2 overlay) + full copy, source.vmRef, same-provider (#153)
ImagePrepare RPC Libvirt: import/convert image into a storage pool (#154)
Task Tracking N/A Async operation monitoring
Console URLs ⚠️ Proxmox console URL: planned
Guest Agent IP detection and guest info
Image Import Libvirt: import into storage pool (#154). vSphere: OVA/content library.
Multi-NIC Multiple network interfaces
Circuit Breaker One CB per Provider CR (v0.3.6)
Cross-Provider Migration S3 + NFS staging backends, both directions, all pairs (ADR-0006, #236). vSphere stages pod-side; libvirt host-side; Proxmox node-side over SSH (NFS via kernel mount). PVC is compat-only.

Quick Start

Prerequisites

  • Kubernetes 1.25+
  • Helm 3.10+
  • Go 1.26+ (for source builds only)

Installation via Helm (Recommended)

  1. Add the Helm repository:

    helm repo add virtrigaud https://projectbeskar.github.io/virtrigaud
    helm repo update
  2. Install VirtRigaud (version 0.3.8):

    helm install virtrigaud virtrigaud/virtrigaud \
      --version 0.3.11 \
      -n virtrigaud-system --create-namespace

    CRDs are installed automatically via Helm hooks. To disable automatic CRD upgrades:

    helm install virtrigaud virtrigaud/virtrigaud \
      --version 0.3.11 \
      -n virtrigaud-system --create-namespace \
      --set crdUpgrade.enabled=false

    Providers are NOT enabled via Helm flags. Create Provider CRs (step 1 below) — the controller deploys provider pods automatically.

  3. Verify the installation:

    kubectl get pods -n virtrigaud-system
    kubectl get crd | grep virtrigaud
  4. Upgrade:

    helm upgrade virtrigaud virtrigaud/virtrigaud \
      --version 0.3.11 \
      -n virtrigaud-system

Development Installation

# Install CRDs
make install

# Run the controller locally
make run

Go 1.26+ is required for source builds.

Using VirtRigaud

  1. Create credentials secrets:

    # Libvirt — SSH key (recommended)
    kubectl create secret generic libvirt-creds -n default \
      --from-literal=username=your-ssh-username \
      --from-file=ssh-privatekey=~/.ssh/id_rsa
    
    # Libvirt — password
    kubectl create secret generic libvirt-creds -n default \
      --from-literal=username=your-ssh-username \
      --from-literal=password='your-ssh-password'
    
    # vSphere
    kubectl create secret generic vsphere-creds -n default \
      --from-literal=username=administrator@vsphere.local \
      --from-literal=password='your-password'
    
    # Proxmox VE — API token (recommended; keys: token_id, token_secret)
    kubectl create secret generic proxmox-creds -n default \
      --from-literal=token_id='virtrigaud@pve!vrtg-token' \
      --from-literal=token_secret='xxxxxxxx-xxxx-4xxx-xxxx-xxxxxxxxxxxx'

    The Proxmox provider reads credentials from files mounted at /etc/virtrigaud/credentials/{token_id,token_secret,username,password}. Do NOT use envFrom: secretRef for Proxmox credentials — that pattern is not implemented.

  2. Create a Provider CR:

    # Libvirt/KVM
    apiVersion: infra.virtrigaud.io/v1beta1
    kind: Provider
    metadata:
      name: libvirt-kvm
      namespace: default
    spec:
      type: libvirt
      endpoint: "qemu+ssh://192.168.1.10/system"
      credentialSecretRef:
        name: libvirt-creds
      runtime:
        image: "ghcr.io/projectbeskar/virtrigaud/provider-libvirt:v0.3.11"
        service:
          port: 9443
    # vSphere
    apiVersion: infra.virtrigaud.io/v1beta1
    kind: Provider
    metadata:
      name: vsphere-datacenter
      namespace: default
    spec:
      type: vsphere
      endpoint: "https://vcenter.example.com:443"
      credentialSecretRef:
        name: vsphere-creds
      runtime:
        image: "ghcr.io/projectbeskar/virtrigaud/provider-vsphere:v0.3.11"
        service:
          port: 9443

    When you apply a Provider CR, the controller creates a dedicated Deployment and Service for the provider pod in the same namespace. Each Provider CR has isolated credentials.

  3. Deploy a VM:

    kubectl apply -f examples/vm-ubuntu-small.yaml
    kubectl get virtualmachine -w

    See examples/ for more examples.

VM Migration

VirtRigaud migrates VMs between providers by staging the disk on a storage-agnostic backend — S3-compatible object storage or an NFS export (ADR-0006). The disk never traverses a CSI PVC; the source exports its native disk format and the target converts on import.

  • S3: the provider pod is the S3 client, so the bytes flow host → pod → S3 → pod → host (the universal relay path).
  • NFS: the disk is staged on an NFS export and moved with qemu-img's native transport — libvirt (host-side) and vSphere (pod-side) use the nfs:// libnfs driver; Proxmox kernel-mounts the export (its qemu-img ships no libnfs). NFS needs nfs.uid/gid set to the export owner; see examples/vmmigration-nfs.yaml.

Validated: all three providers in both directions over both backends — vSphere ⇄ Libvirt/KVM ⇄ Proxmox VE (ADR-0006 Slices 1–4). The Proxmox provider participates as a full source and target and advertises s3 and nfs (not PVC: its disks live on the node, which a pod-mounted PVC can never reach).

apiVersion: infra.virtrigaud.io/v1beta1
kind: VMMigration
metadata:
  name: vm-migration-example
  namespace: default
spec:
  source:
    vmRef:
      name: source-vm
  target:
    name: target-vm
    providerRef:
      name: target-provider
  storage:
    type: s3
    transferMode: relay        # relay (implemented); auto → relay
    s3:
      bucket: virtrigaud
      endpoint: http://minio.example:9000   # omit for AWS S3
      region: us-east-1
      usePathStyle: true                     # true for MinIO/Ceph/rustfs; false for AWS
      credentialsSecretRef:
        name: s3-migration-credentials       # keys: accessKeyID, secretAccessKey

A legacy storage.type: pvc model (ReadWriteMany StorageClass) remains for the vSphere/libvirt directions but is compat-only — it does not work for Proxmox or for host-resident libvirt disks. See examples/migration/ for per-direction examples.

For full migration documentation including provider restart behaviour, see the Migration Guide.

Observability

The manager exposes Prometheus metrics at :8080/metrics (HTTP by default; flip --metrics-secure=true for HTTPS).

11 of 12 virtrigaud_* metric families are active. virtrigaud_queue_depth was deprecated in v0.3.6 (use workqueue_depth{name} instead); removal scheduled for v0.4.0.

For the full metric catalog see Observability.

Troubleshooting

Missing CRDs after Helm install

# Check if CRDs were skipped
helm get values virtrigaud -n virtrigaud-system | grep skip-crds

# Manually install CRDs
kubectl apply -f charts/virtrigaud/crds/

# Or reinstall
helm uninstall virtrigaud -n virtrigaud-system
helm install virtrigaud virtrigaud/virtrigaud --version 0.3.11 \
  -n virtrigaud-system --create-namespace

Development

Building

make build          # Build the manager binary (requires Go 1.26+)
make docker-build   # Build container image
make test           # Run unit tests
make generate manifests  # Regenerate CRDs and DeepCopy

Local Testing

# Quick lint check (before every commit)
./hack/test-lint-locally.sh

# Comprehensive CI testing (before PRs)
./hack/test-ci-locally.sh

# Test Helm charts with Kind cluster
./hack/test-helm-locally.sh

See Testing Workflows Locally for detailed instructions.

Documentation

Primary documentation: https://projectbeskar.github.io/virtrigaud

In-tree design decisions: docs/adr/

Contributing

Contributions are welcome. See CONTRIBUTING.md.

Authors

  • William Rizzo (@wrkode) — project maintainer
  • Erick Bourgeois (@ebourgeois) — project maintainer

License

Apache License 2.0 — see LICENSE.

About

A Kubernetes operator for managing virtual machines across multiple hypervisors.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages