Monitoring#

What Is It?#

Monitoring is how PSW answers the question “is everything still working?” without you having to SSH around and poke at services. It collects two kinds of signals from every target and every app :

Metrics — numbers over time: CPU, memory, disk, response times, queue lengths, request counts
Logs — the text each service writes as it runs: errors, warnings, requests served, things that crashed

Those signals get shipped to a central place, displayed on dashboards, and — when something goes wrong — turned into a phone notification so you actually find out about it.

The Stack#

Everything in this page ships together under one stack :

psw app add @observability --target monitoring

The stack (psw-apps/stacks/observability.yml ) is exactly nine apps. Each one does one thing:

App	Role in one sentence
Prometheus	The metrics database. Scrapes numbers from every app’s `/metrics` endpoint on a schedule and stores the time series
Loki	The logs database. Receives log lines from every target and indexes them by label (host, service, level)
Alloy	The log shipper. Deployed to every target (broadcast app), reads the systemd journal and forwards entries to Loki
Node Exporter	The hardware probe. Also broadcast; exposes CPU/RAM/disk/network metrics for Prometheus to scrape
Grafana	The dashboards. Plots metrics from Prometheus and logs from Loki side by side
Alertmanager	The alert router. When Prometheus trips an alert rule, Alertmanager groups, deduplicates, and forwards it
ntfy	The phone notifier. Alertmanager fires webhooks at it, ntfy pushes to your phone
Blackbox Exporter	The synthetic prober. Hits HTTP/TCP/DNS endpoints from outside the app and reports up/down
Uptime Kuma	The status page. A friendlier “are my services reachable?” dashboard at `https://status.<domain>`

You don’t usually need all nine — the stack bundles them because they all fit together cleanly, but you can psw app add individual ones if you want just metrics or just logs.

Metrics: How Prometheus Finds Every App#

When you add a new app, Prometheus picks it up automatically — no manual scrape config to edit. The plumbing is a convention :

An app sets monitoring_enabled: true in its meta.yml (most catalog apps already do)
At deploy time the monitoring convention renderer generates a tiny scrape config file at services/<app>/monitoring/<app>-scrape.yml — host, port, metrics path
Prometheus declares itself as the aggregator for the monitoring convention. Every render pass copies every per-app *-scrape.yml into services/prometheus/scrape-configs/ (a .gitignored derived directory — the per-app files are the source of truth). The execution plan ’s SYNC node rsyncs that aggregated directory to the Prometheus container
Prometheus is started with --web.enable-lifecycle, and its config loads scrape_config_files: scrape-configs/*.yml. The aggregator triggers a live reload, no restart
Next scrape interval, Prometheus is already hitting the new app

Remove the app and the reverse happens: the scrape file disappears, the aggregator syncs, Prometheus stops scraping.

When an app doesn’t have `/metrics` of its own#

Some apps (PostgreSQL , qBittorrent , Sonarr and the rest of the *arr family, Tdarr , Seerr ) don’t speak Prometheus natively. For them, meta.yml declares an exporter_image:

# sonarr/meta.yml
monitoring_enabled: true
exporter_image: ghcr.io/onedr0p/exportarr:v2.3.0

The deploy engine notices this and spins up a sidecar container next to the main app, sharing its network namespace. The sidecar queries the app’s native API and re-emits the numbers as Prometheus metrics. The scrape config points Prometheus at the sidecar’s port instead of the app’s. You never see it — it just works.

Logs: How Every Log Line Reaches Loki#

Logs ride a completely different path — push instead of pull:

Alloy is a broadcast app . It’s deployed to every managed target automatically, no opt-in
On each target, Alloy reads /var/log/journal (the systemd journal — every container’s stdout/stderr and every host service writes here)
It relabels entries with useful tags (service name, unit, host) and forwards them to Loki via loki.write at http://<loki_ip>:3100/loki/api/v1/push
Loki indexes by label and stores the content. In Grafana’s Explore view you can filter by host, service, level, or search the log text directly

Node Exporter gets the same broadcast treatment — deployed everywhere, Prometheus scrapes every instance — so you see hardware metrics per target without configuring anything per target.

Dashboards#

Grafana ships with a curated set of dashboards pre-provisioned from its own psw-apps/grafana/ directory. Currently bundled (exact list in meta.yml → extra_files):

Alertmanager, Blackbox Exporter, Forgejo, Loki metrics, Node Exporter, PostgreSQL, qBittorrent, Radarr, Sonarr, Traefik

They’re loaded as read-only: you can clone or save-as to edit, but the originals stay pristine across upgrades. Two datasources are auto-provisioned alongside them — Prometheus and Loki — so dashboards and Explore work the moment Grafana finishes starting.

Unlike scrape configs, dashboards don’t aggregate from individual apps’ meta.yml — they’re hand-curated in the Grafana app itself.

Alerts: From “something broke” to Your Phone#

Prometheus evaluates alert rules shipped in psw-apps/prometheus/templates/alert-rules.yml.j2 . The defaults cover:

Always — InstanceDown, HighMemoryUsage, DiskSpaceLow, HighCpuUsage (these all come from Node Exporter metrics)
Conditional on what’s deployed — PostgreSQL rules if postgres is installed, Traefik rules if traefik is installed, Loki error rules if loki is installed

When a rule fires, Prometheus sends the alert to Alertmanager. Alertmanager’s config (psw-apps/alertmanager/templates/alertmanager.yml.j2 ) auto-configures an ntfy webhook receiver the moment ntfy is deployed — alerts land at the psw-alerts topic. Default alerts repeat every 4 hours; critical ones every 1 hour.

Subscribe your phone’s ntfy app to the psw-alerts topic once, and every alert shows up as a push notification. Optional SMTP email is available too if psw_smtp_host is configured.

Synthetic Probes: Blackbox Exporter#

Prometheus can only scrape apps that expose metrics. Blackbox Exporter fills the gap by probing endpoints like a user would: HTTP, HTTPS, TCP, DNS, ICMP. Prometheus doesn’t auto-generate probe targets — you declare what you want checked via prometheus_extra_scrape_targets.

Typical use: probing your public domain from inside your network to catch certificate issues, or monitoring an external service your apps depend on.

Up-or-Down at a Glance: Uptime Kuma#

Grafana is powerful but dense. Uptime Kuma at https://status.<your-domain> is the user-friendly view: a big grid of green/red dots showing which apps are currently reachable. It’s different from Prometheus in an important way:

Prometheus answers “how much CPU is Jellyfin using?”
Uptime Kuma answers “is jellyfin.example.com responding right now?”

Uptime Kuma’s trick: it has a reconciler (psw-apps/psw_apps/uptime_kuma/reconcilers.py ) that automatically creates a monitor for every deployed app that has a subdomain. Add a new app, uptime_kuma auto-gets a green dot for it. It also has a uptime_kuma.ntfy reconciler that wires ntfy notifications when monitors go down.

Homepage Is a Different Thing#

Homepage is sometimes mentioned in the same breath but belongs to a different category (Infrastructure, not Observability). Homepage is the “links to all my apps in one place” dashboard — Grafana/Loki/Prometheus are the “what are they actually doing?” backend. Both can coexist; they answer different questions.

Key Ideas#

Auto-discovery for metrics — monitoring_enabled: true in an app’s meta.yml is the whole opt-in; the monitoring convention does the rest
Sidecars for apps without /metrics — exporter_image spins up a metrics translator next to the app, invisibly
Logs via broadcast — Alloy deploys to every target, ships the systemd journal to Loki ; no per-app config
Dashboards come with the stack — Grafana ships curated read-only boards for the apps in the catalog; datasources auto-provisioned
Alerts land on your phone — deploy ntfy and Alertmanager wires the webhook automatically
Two levels of “is it up?” — Prometheus + Blackbox Exporter for deep metrics and external probes; Uptime Kuma for an at-a-glance status page