High Availability#

Status: roadmap. PSW can form Proxmox clusters today (psw node cluster), but the pieces below — ha: true on targets, shared_storage: in network.yml, psw maintenance start/end, automatic failover wiring — are not yet implemented. This page captures the intended design so contributors and early users have one north star to build toward.

What Is It?#

High availability (often shortened to HA) means your apps keep running even when a server fails or needs to go offline for maintenance. If one of your Proxmox nodes goes down — whether from a dead power supply, a crashed disk, or because you’re installing updates — your targets automatically restart on another node. Your family keeps streaming, your passwords stay accessible, your smart home keeps working.

Think of it like a relay race: if one runner stumbles, the next runner picks up the baton and keeps going. Your targets are the baton — they always keep moving, even when the runner underneath them changes.

Why Does It Matter?#

Hardware Failures#

Servers break. Power supplies die, RAM sticks fail, disks corrupt. Without HA, a single dead server takes down every app running on it — and they stay down until you notice, diagnose, and manually fix things. That could be hours, or days if you’re away from home.

With HA, Proxmox detects the failure within seconds and automatically restarts your targets on a healthy node — typically back online within a couple of minutes, without you lifting a finger.

Planned Maintenance#

Even healthy servers need maintenance: firmware updates, OS upgrades, hardware swaps, adding more RAM. Without HA, maintenance means downtime — you shut everything down, do the work, and hope nothing goes wrong when you start things back up.

With HA, PSW live-migrates your targets to another node before you touch anything. Your apps keep running the entire time, on a different node. When you’re done, they move back. Zero downtime for your family.

The Three Ingredients#

High availability needs three things working together. Remove any one and HA can’t function.

┌──────────────────────────────────────────────────────────┐
│  1. Multiple Nodes                                       │
│     Two or more Proxmox servers that work as a team      │
├──────────────────────────────────────────────────────────┤
│  2. Shared Storage (NFS)                                 │
│     A central place for target data, reachable from      │
│     every node                                           │
├──────────────────────────────────────────────────────────┤
│  3. HA Manager                                           │
│     Proxmox's built-in watchdog that monitors nodes      │
│     and moves targets when something goes wrong          │
└──────────────────────────────────────────────────────────┘

Multiple Nodes (Proxmox Cluster)#

A cluster is two or more Proxmox nodes that are aware of each other and coordinate as a team. They continuously exchange heartbeat signals — small network messages that say “I’m alive.” When heartbeats stop arriving from a node, the cluster knows something is wrong.

PSW forms the cluster automatically when you define multiple nodes in your user project . You set up each physical server with psw-proxmox-installer as usual, then declare them together:

# network.yml
management:
  gateway: 10.10.0.1
  hosts:
    node1:
      ip: 10.10.0.198
      roles: [proxmox]
    node2:
      ip: 10.10.0.199
      roles: [proxmox]

cluster:
  nodes: [node1, node2]

That’s enough for PSW to join them into a Proxmox cluster during bootstrap . You can add more nodes later — PSW handles joining them to the existing cluster.

Shared Storage (NFS)#

NFS (Network File System) is a protocol that lets multiple computers access the same files over the network, as if they were on a local disk. Think of it like a shared filing cabinet in the middle of an office — anyone at any desk can open the same drawer.

Without shared storage, a target’s data lives on one node’s local disk. If that node dies, the data is trapped on it — another node can’t start the target because it can’t reach the disk. Shared storage solves this by putting target data on a network device that every node can access:

Node 1 ──┐
         ├──► NFS Server ◄──── Target data lives here
Node 2 ──┘

Now any node can start any target, because the data isn’t tied to a specific server.

The NFS server can be one of your servers, that has enough storage or a NAS (Network Attached Storage — a dedicated device for storing files, like a Synology , QNAP , or TrueNAS box). PSW configures every node to mount the shared storage automatically:

# network.yml (continued)
cluster:
  nodes: [node1, node2]
  shared_storage:
    type: nfs
    server: 10.10.0.50       # Your NAS IP
    export: /mnt/pool/psw     # The shared folder on the NAS
    mount: /mnt/shared        # Where nodes mount it locally

During bootstrap , PSW mounts the NFS share on every node and configures Proxmox to use it as a storage backend. From that point on, all managed targets store their data on the shared storage by default.

HA Manager#

The HA Manager is Proxmox’s built-in system for monitoring targets and restarting them on healthy nodes when something goes wrong. PSW configures it for you — you just declare which targets should be protected.

When a node disappears from the cluster (heartbeats stop), the HA Manager:

Fences the failed node — makes sure it’s truly isolated and can’t still be running targets. This prevents split-brain (a dangerous situation where two nodes both try to run the same target, which would corrupt data)
Restarts the HA-protected targets on a surviving node
Reports what happened, so convergence and the dashboard know

How PSW Configures It#

Enabling HA on Targets#

Each target can opt into high availability. The node field becomes the preferred node — where the target runs under normal conditions — rather than the only possible node:

# network.yml
targets:
  core:
    type: lxc
    node: node1           # Preferred node
    ha: true              # Protected by HA
    cores: 8
    memory: 40960
    disk: 200
  media:
    type: lxc
    node: node1
    ha: true
    cores: 4
    memory: 4096
    disk: 100
  monitoring:
    type: lxc
    node: node2           # Spread load across nodes
    ha: true
    cores: 4
    memory: 4096
    disk: 50

Targets without ha: true stay pinned to their node. If that node goes down, those targets stay down until the node recovers. This is fine for non-critical workloads, but anything your family depends on should have HA enabled.

Bare targets (like a VPS running Pangolin ) are not part of the Proxmox cluster and don’t participate in HA — they’re managed externally.

What Bootstrap Does#

When your project defines a cluster, bootstrap handles the extra setup automatically:

Forms the cluster — joins all defined nodes into a Proxmox cluster
Mounts shared storage — configures NFS on every node
Registers HA resources — tells the Proxmox HA Manager which targets to protect
Deploys core apps — as usual, but now on shared storage with HA protection

After bootstrap, the HA Manager runs continuously on the cluster, watching every protected target.

What Happens When a Node Fails#

Here’s the step-by-step sequence when a node unexpectedly goes offline:

1. Node 1 loses power
   └── Heartbeats stop reaching Node 2

2. Cluster detects the failure (within seconds)
   └── HA Manager confirms Node 1 is unreachable

3. Fencing
   └── Cluster verifies Node 1 is truly isolated
   └── Prevents split-brain (two nodes running the same target)

4. Restart on healthy node
   └── HA Manager starts Node 1's protected targets on Node 2
   └── Targets boot from shared storage — same data, different node

5. Targets come online (typically 1-2 minutes total)
   └── Apps start serving requests again

6. Convergence reconciles
   └── Detects targets moved to Node 2
   └── Updates DNS records if IPs changed (via providers)
   └── Runs wiring to verify app connections
   └── Dashboard reflects the new state

Your apps experience a brief interruption (the time it takes to restart), but everything comes back automatically. No SSH, no manual intervention, no panic.

When the Node Comes Back#

When the failed node recovers and rejoins the cluster, targets don’t automatically move back. This is intentional — they’re already running fine on the healthy node, and moving them would cause unnecessary downtime. Targets return to their preferred node during the next planned maintenance window, or you can trigger it manually.

Planned Maintenance#

Maintenance is where HA truly shines. Instead of an outage, you get a controlled, zero-downtime migration.

Starting Maintenance#

psw -C ~/my-project maintenance start node1

This tells PSW to drain the node — live-migrate all its targets to other nodes in the cluster:

Before:                          After drain:
┌─────────┐  ┌─────────┐        ┌─────────┐  ┌─────────┐
│  Node 1 │  │  Node 2 │        │  Node 1 │  │  Node 2 │
│  core   │  │ monitor │   →    │ (empty) │  │ monitor │
│  media  │  │         │        │         │  │  core   │
│         │  │         │        │         │  │  media  │
└─────────┘  └─────────┘        └─────────┘  └─────────┘

Live migration means the target keeps running while its state transfers to the new node. Apps experience at most a brief pause (typically under a second) — no restart, no lost connections. This is possible because shared storage means the data is already accessible from both nodes; only the running state needs to move.

Once the drain completes, the node is safe to shut down, reboot, or open up for hardware work.

Ending Maintenance#

psw -C ~/my-project maintenance end node1

PSW migrates targets back to their preferred nodes, rebalancing the cluster:

After maintenance:
┌────────────┐  ┌─────────┐
│  Node 1    │  │  Node 2 │
│  core      │  │ monitor │
│  media     │         │
└────────────┘  └─────────┘

Everything returns to its intended layout, and convergence reconciles any state that shifted.

What About the Bootstrap Target?#

The bootstrap target is special — it runs the core apps and the convergence engine. With HA, it’s protected like any other target. If the node hosting it fails, the bootstrap target restarts on another node, and convergence resumes on the next timer tick.

This means your entire automation pipeline — including the engine that keeps everything in sync — survives a node failure.

Capacity Planning#

When one node is down (for failure or maintenance), the surviving nodes must handle all the targets. PSW validates this during project graph construction:

If the combined resource requirements of all HA targets exceed what the surviving nodes can provide, PSW raises a warning during psw project validate
This check ensures you don’t discover capacity problems during an actual failure

A good rule of thumb: each node should have enough spare capacity to absorb its neighbor’s workload. Two identical nodes means each runs at roughly 50% capacity under normal conditions.

How It Connects to Everything Else#

System	How HA Interacts With It
Convergence	Detects when targets have moved between nodes and reconciles DNS, wiring , and conventions
Providers	Updates DNS records if a target’s IP changes after migration
Infrastructure	Extends the node/target model with cluster awareness and shared storage
Bootstrap	Forms the cluster, mounts shared storage, registers HA resources
Project Graph	Validates capacity — ensures surviving nodes can absorb failed node’s targets
Idempotency	Convergence safely reconciles after failover without creating duplicates

Key Ideas#

Automatic failover — targets restart on healthy nodes within minutes, no manual intervention
Zero-downtime maintenance — live migration moves running targets between nodes without interruption
Shared storage is the enabler — NFS lets any node access any target’s data, making migration and failover possible
Declarative — you define ha: true on a target; PSW handles cluster formation, storage, and HA registration
Capacity-aware — PSW validates that surviving nodes can handle the extra load before you’re in a crisis
Convergence-integrated — after any failover or migration, convergence reconciles DNS, wiring, and conventions automatically
Fencing prevents corruption — the cluster verifies a failed node is truly isolated before restarting its targets elsewhere