High Availability#
Status: roadmap. PSW can form Proxmox clusters today (
psw node cluster), but the pieces below —ha: trueon targets,shared_storage:innetwork.yml,psw maintenance start/end, automatic failover wiring — are not yet implemented. This page captures the intended design so contributors and early users have one north star to build toward.
What Is It?#
High availability (often shortened to HA) means your apps keep running even when a server fails or needs to go offline for maintenance. If one of your Proxmox nodes goes down — whether from a dead power supply, a crashed disk, or because you’re installing updates — your targets automatically restart on another node. Your family keeps streaming, your passwords stay accessible, your smart home keeps working.
Think of it like a relay race: if one runner stumbles, the next runner picks up the baton and keeps going. Your targets are the baton — they always keep moving, even when the runner underneath them changes.
Why Does It Matter?#
Hardware Failures#
Servers break. Power supplies die, RAM sticks fail, disks corrupt. Without HA, a single dead server takes down every app running on it — and they stay down until you notice, diagnose, and manually fix things. That could be hours, or days if you’re away from home.
With HA, Proxmox detects the failure within seconds and automatically restarts your targets on a healthy node — typically back online within a couple of minutes, without you lifting a finger.
Planned Maintenance#
Even healthy servers need maintenance: firmware updates, OS upgrades, hardware swaps, adding more RAM. Without HA, maintenance means downtime — you shut everything down, do the work, and hope nothing goes wrong when you start things back up.
With HA, PSW live-migrates your targets to another node before you touch anything. Your apps keep running the entire time, on a different node. When you’re done, they move back. Zero downtime for your family.
The Three Ingredients#
High availability needs three things working together. Remove any one and HA can’t function.
┌──────────────────────────────────────────────────────────┐
│ 1. Multiple Nodes │
│ Two or more Proxmox servers that work as a team │
├──────────────────────────────────────────────────────────┤
│ 2. Shared Storage (NFS) │
│ A central place for target data, reachable from │
│ every node │
├──────────────────────────────────────────────────────────┤
│ 3. HA Manager │
│ Proxmox's built-in watchdog that monitors nodes │
│ and moves targets when something goes wrong │
└──────────────────────────────────────────────────────────┘Multiple Nodes (Proxmox Cluster)#
A cluster is two or more Proxmox nodes that are aware of each other and coordinate as a team. They continuously exchange heartbeat signals — small network messages that say “I’m alive.” When heartbeats stop arriving from a node, the cluster knows something is wrong.
PSW forms the cluster automatically when you define multiple nodes in your user project
. You set up each physical server with psw-proxmox-installer as usual, then declare them together:
# network.yml
management:
gateway: 10.10.0.1
hosts:
node1:
ip: 10.10.0.198
roles: [proxmox]
node2:
ip: 10.10.0.199
roles: [proxmox]
cluster:
nodes: [node1, node2]That’s enough for PSW to join them into a Proxmox cluster during bootstrap . You can add more nodes later — PSW handles joining them to the existing cluster.
Shared Storage (NFS)#
NFS (Network File System) is a protocol that lets multiple computers access the same files over the network, as if they were on a local disk. Think of it like a shared filing cabinet in the middle of an office — anyone at any desk can open the same drawer.
Without shared storage, a target’s data lives on one node’s local disk. If that node dies, the data is trapped on it — another node can’t start the target because it can’t reach the disk. Shared storage solves this by putting target data on a network device that every node can access:
Node 1 ──┐
├──► NFS Server ◄──── Target data lives here
Node 2 ──┘Now any node can start any target, because the data isn’t tied to a specific server.
The NFS server can be one of your servers, that has enough storage or a NAS (Network Attached Storage — a dedicated device for storing files, like a Synology , QNAP , or TrueNAS box). PSW configures every node to mount the shared storage automatically:
# network.yml (continued)
cluster:
nodes: [node1, node2]
shared_storage:
type: nfs
server: 10.10.0.50 # Your NAS IP
export: /mnt/pool/psw # The shared folder on the NAS
mount: /mnt/shared # Where nodes mount it locallyDuring bootstrap , PSW mounts the NFS share on every node and configures Proxmox to use it as a storage backend. From that point on, all managed targets store their data on the shared storage by default.
HA Manager#
The HA Manager is Proxmox’s built-in system for monitoring targets and restarting them on healthy nodes when something goes wrong. PSW configures it for you — you just declare which targets should be protected.
When a node disappears from the cluster (heartbeats stop), the HA Manager:
- Fences the failed node — makes sure it’s truly isolated and can’t still be running targets. This prevents split-brain (a dangerous situation where two nodes both try to run the same target, which would corrupt data)
- Restarts the HA-protected targets on a surviving node
- Reports what happened, so convergence and the dashboard know
How PSW Configures It#
Enabling HA on Targets#
Each target
can opt into high availability. The node field becomes the preferred node — where the target runs under normal conditions — rather than the only possible node:
# network.yml
targets:
core:
type: lxc
node: node1 # Preferred node
ha: true # Protected by HA
cores: 8
memory: 40960
disk: 200
media:
type: lxc
node: node1
ha: true
cores: 4
memory: 4096
disk: 100
monitoring:
type: lxc
node: node2 # Spread load across nodes
ha: true
cores: 4
memory: 4096
disk: 50Targets without ha: true stay pinned to their node. If that node goes down, those targets stay down until the node recovers. This is fine for non-critical workloads, but anything your family depends on should have HA enabled.
Bare targets (like a VPS running Pangolin ) are not part of the Proxmox cluster and don’t participate in HA — they’re managed externally.
What Bootstrap Does#
When your project defines a cluster, bootstrap handles the extra setup automatically:
- Forms the cluster — joins all defined nodes into a Proxmox cluster
- Mounts shared storage — configures NFS on every node
- Registers HA resources — tells the Proxmox HA Manager which targets to protect
- Deploys core apps — as usual, but now on shared storage with HA protection
After bootstrap, the HA Manager runs continuously on the cluster, watching every protected target.
What Happens When a Node Fails#
Here’s the step-by-step sequence when a node unexpectedly goes offline:
1. Node 1 loses power
└── Heartbeats stop reaching Node 2
2. Cluster detects the failure (within seconds)
└── HA Manager confirms Node 1 is unreachable
3. Fencing
└── Cluster verifies Node 1 is truly isolated
└── Prevents split-brain (two nodes running the same target)
4. Restart on healthy node
└── HA Manager starts Node 1's protected targets on Node 2
└── Targets boot from shared storage — same data, different node
5. Targets come online (typically 1-2 minutes total)
└── Apps start serving requests again
6. Convergence reconciles
└── Detects targets moved to Node 2
└── Updates DNS records if IPs changed (via providers)
└── Runs wiring to verify app connections
└── Dashboard reflects the new stateYour apps experience a brief interruption (the time it takes to restart), but everything comes back automatically. No SSH, no manual intervention, no panic.
When the Node Comes Back#
When the failed node recovers and rejoins the cluster, targets don’t automatically move back. This is intentional — they’re already running fine on the healthy node, and moving them would cause unnecessary downtime. Targets return to their preferred node during the next planned maintenance window, or you can trigger it manually.
Planned Maintenance#
Maintenance is where HA truly shines. Instead of an outage, you get a controlled, zero-downtime migration.
Starting Maintenance#
psw -C ~/my-project maintenance start node1This tells PSW to drain the node — live-migrate all its targets to other nodes in the cluster:
Before: After drain:
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node 1 │ │ Node 2 │ │ Node 1 │ │ Node 2 │
│ core │ │ monitor │ → │ (empty) │ │ monitor │
│ media │ │ │ │ │ │ core │
│ │ │ │ │ │ │ media │
└─────────┘ └─────────┘ └─────────┘ └─────────┘Live migration means the target keeps running while its state transfers to the new node. Apps experience at most a brief pause (typically under a second) — no restart, no lost connections. This is possible because shared storage means the data is already accessible from both nodes; only the running state needs to move.
Once the drain completes, the node is safe to shut down, reboot, or open up for hardware work.
Ending Maintenance#
psw -C ~/my-project maintenance end node1PSW migrates targets back to their preferred nodes, rebalancing the cluster:
After maintenance:
┌────────────┐ ┌─────────┐
│ Node 1 │ │ Node 2 │
│ core │ │ monitor │
│ media │ │
└────────────┘ └─────────┘Everything returns to its intended layout, and convergence reconciles any state that shifted.
What About the Bootstrap Target?#
The bootstrap target is special — it runs the core apps and the convergence engine. With HA, it’s protected like any other target. If the node hosting it fails, the bootstrap target restarts on another node, and convergence resumes on the next timer tick.
This means your entire automation pipeline — including the engine that keeps everything in sync — survives a node failure.
Capacity Planning#
When one node is down (for failure or maintenance), the surviving nodes must handle all the targets. PSW validates this during project graph construction:
- If the combined resource requirements of all HA targets exceed what the surviving nodes can provide, PSW raises a warning during
psw project validate - This check ensures you don’t discover capacity problems during an actual failure
A good rule of thumb: each node should have enough spare capacity to absorb its neighbor’s workload. Two identical nodes means each runs at roughly 50% capacity under normal conditions.
How It Connects to Everything Else#
| System | How HA Interacts With It |
|---|---|
| Convergence | Detects when targets have moved between nodes and reconciles DNS, wiring , and conventions |
| Providers | Updates DNS records if a target’s IP changes after migration |
| Infrastructure | Extends the node/target model with cluster awareness and shared storage |
| Bootstrap | Forms the cluster, mounts shared storage, registers HA resources |
| Project Graph | Validates capacity — ensures surviving nodes can absorb failed node’s targets |
| Idempotency | Convergence safely reconciles after failover without creating duplicates |
Key Ideas#
- Automatic failover — targets restart on healthy nodes within minutes, no manual intervention
- Zero-downtime maintenance — live migration moves running targets between nodes without interruption
- Shared storage is the enabler — NFS lets any node access any target’s data, making migration and failover possible
- Declarative — you define
ha: trueon a target; PSW handles cluster formation, storage, and HA registration - Capacity-aware — PSW validates that surviving nodes can handle the extra load before you’re in a crisis
- Convergence-integrated — after any failover or migration, convergence reconciles DNS, wiring, and conventions automatically
- Fencing prevents corruption — the cluster verifies a failed node is truly isolated before restarting its targets elsewhere