Convergence#

What Is It?#

Convergence is PSW’s automatic deployment engine. It runs on your bootstrap target (the server where core apps live) and continuously makes sure your self-hosted solution matches what’s defined in your user project files.

Think of it like a diligent assistant: every 5 minutes, it checks “did anything change in the project?” and if so, it applies those changes — creating targets, deploying apps, wiring them together, or tearing down things you’ve removed.

Why Does It Exist?#

Without convergence, you’d have to manually SSH into your server and run deployment commands every time you add or change an app . With convergence, you just:

  1. Edit your project files (add an app, change a setting)
  2. git commit && git push
  3. Walk away — convergence handles the rest

This is called GitOps — using a git repository as the single source of truth, with the system automatically converging toward the desired state.

How Does It Get Triggered?#

Convergence can start in three ways:

TriggerHowWhen
TimerSystemd timer (Linux service scheduler) fires every 5 minutesAutomatic, always running
Git pushForgejo webhook notifies the dashboardImmediate after you push changes
Manualpsw deploy converge from your workstationOn demand

The timer is the primary mechanism. It runs as a systemd user service on the bootstrap target , so it survives reboots and keeps running even when you’re not around.

What Happens During Convergence?#

Convergence runs through six phases, always in order. If something crashes, it picks up from where it left off on the next run.

Phase 1: Reconcile#

“Does my state file match reality?”

The convergence engine keeps a state file (.psw/converge/state.yml) tracking what it knows about your targets — their IPs, VMIDs, and whether they’ve been prepared.

Sometimes reality drifts from this state (a crash lost the file, a target was manually removed, etc.). This phase queries each Proxmox node to see what actually exists and fixes any discrepancies.

Phase 2: Infrastructure#

“Do the right targets exist?”

Compares what targets are defined (in network.yml → targets) with what actually exists on Proxmox:

  • Missing targets: Creates a new managed target on the right Proxmox node and starts it with the static IP planned in network.yml
  • Plan-edited IPs: When you change a target’s ip in network.yml, the target is reconfigured in place (pct set --net0 + reboot) on the next tick — the plan stays the source of truth
  • Removed targets: Destroys targets that are no longer defined
  • Unreachable targets: Tracks how long they’ve been down. After enough retries, tries to rediscover the IP from Proxmox so local state can heal; the network.yml plan is never overwritten

Phase 3: Prepare Targets#

“Are the targets ready for apps?”

New targets need some one-time setup before apps can be deployed: installing Podman (the app runtime), creating the sysops user, setting up SSH keys. This phase runs the preparation role on any target that hasn’t been prepared yet.

Phase 4: Teardown#

“Are there apps that need to be removed?”

When you run psw app remove jellyfin, it sets state: absent in the service manifest. This phase finds those markers and tears down the apps — stopping them, removing data (unless you chose to keep it), and cleaning up.

Phase 5: Deploy#

“Deploy the apps.”

This is the main event, and it doesn’t run as a fixed sequence of sub-phases — it runs as an execution plan , a dependency graph where unrelated work happens in parallel and dependent work waits automatically. Conceptually:

  1. Renders all deployment files (container definitions, environment configs, app-specific templates) from your service definitions and secrets . At this step the agent also rebuilds the aggregator output directories (services/traefik/dynamic/, services/prometheus/scrape-configs/, services/backrest/backup-plans/) from the per-app convention fragments — those dirs are .gitignored because they’re derived; every agent run reconstructs them so git stays focused on what humans author.
  2. Builds an execution plan — one node per piece of work (deploy this app, sync this aggregator, run these reconcilers, update DNS), wired together with “wait for X” edges derived from meta.yml . If only specific apps changed, the plan is pruned to the integration neighborhood so unaffected work is skipped.
  3. Walks the plan — deploys each app (resolving storage, creating databases, copying files, starting containers, checking health), runs its setup reconcilers inline before sidecars start, syncs convention files for each aggregator (Traefik routes, Authelia SSO, Prometheus scrape targets, Homepage widgets), runs each app’s integration reconcilers (Sonarr registering in Prowlarr, Backrest hooking into ntfy…), and reconciles DNS records via your providers .

Because everything is one DAG, a failure in one app only blocks nodes that explicitly depended on it — unrelated apps keep deploying. See execution plan for the full story.

Phase 6: Finalize#

“Record what happened.”

Updates the convergence state with the results: which commit was deployed, which targets succeeded or failed, and when it all happened. This information powers the dashboard and helps the next run know where to pick up.

Drift Detection: How Does It Know What Changed?#

When convergence runs, it compares the current git HEAD with the last successfully deployed commit using git diff. It categorizes every changed file:

Changed FileWhat It Means
services/jellyfin/service.ymlJellyfin was added, removed, or modified
network.ymlInfrastructure changed — deploy everything
secrets/apps.ymlSecrets changed — deploy everything
roles/jellyfin/*App recipe changed — deploy everything

When only specific services changed, convergence is smart enough to deploy only the affected targets. When something broad changed (network, secrets, roles), it deploys everything to be safe.

Self-Loop Prevention#

Convergence itself makes commits (e.g., updating convention files). To avoid an infinite loop where convergence triggers itself, it detects when all commits since the last run were made by convergence and skips.

Safety Mechanisms#

Convergence runs automatically, so it needs guardrails:

Lock#

Only one convergence can run at a time. A lock file (.psw/converge/state.lock) prevents concurrent runs. If a previous run crashed and left a stale lock, convergence detects the dead process and cleans up.

Failure Throttle#

If convergence fails 3 times in a row, it starts throttling — skipping 2 out of every 3 timer fires. This prevents a broken config from hammering the system every 5 minutes.

System Load Check#

Before starting, convergence checks the system’s CPU load. If the server is already under heavy load, it skips this run. No point making things worse.

Teardown Retry Cap#

If tearing down an app fails 3 times in a row, convergence stops trying and shows a warning. This prevents a single broken teardown from blocking all other work.

Resource Limits#

The convergence systemd service has hard limits: 2 GB memory, 80% CPU, 1 hour timeout. This protects the system from runaway deployments.

The State File#

Convergence keeps its memory in .psw/converge/state.yml — a gitignored file on the bootstrap target. It tracks:

  • Last deployed commit — so it knows what’s new
  • Target states — IP, VMID, prepared status, deploy status for each target
  • Failed teardowns — apps that couldn’t be removed (with retry counts)
  • Integration state — per-aggregator sync status
  • Run history — the last 50 runs (used by the dashboard)

This file is never committed to git — it’s runtime state that only matters on the server.

The Convergence Report#

After every run, convergence produces a structured report showing:

  • Overall result: Success, Degraded (apps deployed but integration failed), Partial (some targets failed), or Failed
  • Per-phase breakdown with timing
  • Per-target and per-app outcomes for the deploy phase
  • Action items if something needs attention

How Is Convergence Installed?#

During bootstrap , PSW:

  1. Copies the convergence source code to the bootstrap target
  2. Creates a Python virtual environment with all dependencies
  3. Clones your user project repo from Forgejo
  4. Installs systemd timer + service units under the sysops user
  5. Seeds the convergence state with the current commit (so it doesn’t re-deploy everything on first run)
  6. Enables and starts the timer

From that point on, convergence runs autonomously.

The Lifecycle#

Here’s the full picture of how a change flows through the system:

You: psw app add grafana --target monitoring  (see [apps](apps.md#app-lifecycle))
     git commit && git push
          │
          ▼
Forgejo receives the push
          │
          ▼
Timer fires (or webhook triggers)
          │
          ▼
Convergence starts
  ├── Phase 1: Reconcile (state matches reality? ✓)
  ├── Phase 2: Infrastructure (monitoring target exists? create if not)
  ├── Phase 3: Prepare (monitoring target has podman + sysops? ✓)
  ├── Phase 4: Teardown (any apps marked absent? no)
  ├── Phase 5: Deploy
  │   ├── Render templates and resolve storage
  │   ├── Deploy [Grafana](https://github.com/grafana/grafana) to monitoring target
  │   ├── Readiness check passes
  │   └── Integration: Traefik gets route, Authelia gets SSO config (see [conventions](conventions.md))
  └── Phase 6: Finalize (record success, update state)
          │
          ▼
grafana.yourdomain.ca is live

Key Concepts#

  • Pull-based: The server pulls changes from git, rather than you pushing commands to the server
  • Idempotent : Running convergence twice without changes results in zero changes
  • Self-healing: If state is lost or targets disappear, reconciliation fixes it
  • Phased: Each phase completes fully before the next begins, with crash recovery
  • Throttled: Failures don’t cause infinite retry loops
  • Observable: Every run produces a structured report visible in the dashboard