Monitoring#
What Is It?#
Monitoring is how PSW answers the question “is everything still working?” without you having to SSH around and poke at services. It collects two kinds of signals from every target and every app :
- Metrics — numbers over time: CPU, memory, disk, response times, queue lengths, request counts
- Logs — the text each service writes as it runs: errors, warnings, requests served, things that crashed
Those signals get shipped to a central place, displayed on dashboards, and — when something goes wrong — turned into a phone notification so you actually find out about it.
The Stack#
Everything in this page ships together under one stack :
psw app add @observability --target monitoringThe stack (psw-apps/stacks/observability.yml
) is exactly nine apps. Each one does one thing:
| App | Role in one sentence |
|---|---|
| Prometheus | The metrics database. Scrapes numbers from every app’s /metrics endpoint on a schedule and stores the time series |
| Loki | The logs database. Receives log lines from every target and indexes them by label (host, service, level) |
| Alloy | The log shipper. Deployed to every target (broadcast app), reads the systemd journal and forwards entries to Loki |
| Node Exporter | The hardware probe. Also broadcast; exposes CPU/RAM/disk/network metrics for Prometheus to scrape |
| Grafana | The dashboards. Plots metrics from Prometheus and logs from Loki side by side |
| Alertmanager | The alert router. When Prometheus trips an alert rule, Alertmanager groups, deduplicates, and forwards it |
| ntfy | The phone notifier. Alertmanager fires webhooks at it, ntfy pushes to your phone |
| Blackbox Exporter | The synthetic prober. Hits HTTP/TCP/DNS endpoints from outside the app and reports up/down |
| Uptime Kuma | The status page. A friendlier “are my services reachable?” dashboard at https://status.<domain> |
You don’t usually need all nine — the stack bundles them because they all fit together cleanly, but you can psw app add individual ones if you want just metrics or just logs.
Metrics: How Prometheus Finds Every App#
When you add a new app, Prometheus picks it up automatically — no manual scrape config to edit. The plumbing is a convention :
- An app sets
monitoring_enabled: truein itsmeta.yml(most catalog apps already do) - At deploy time the monitoring convention renderer generates a tiny scrape config file at
services/<app>/monitoring/<app>-scrape.yml— host, port, metrics path - Prometheus declares itself as the aggregator
for the
monitoringconvention. Every render pass copies every per-app*-scrape.ymlintoservices/prometheus/scrape-configs/(a.gitignored derived directory — the per-app files are the source of truth). The execution plan ’s SYNC node rsyncs that aggregated directory to the Prometheus container - Prometheus is started with
--web.enable-lifecycle, and its config loadsscrape_config_files: scrape-configs/*.yml. The aggregator triggers a live reload, no restart - Next scrape interval, Prometheus is already hitting the new app
Remove the app and the reverse happens: the scrape file disappears, the aggregator syncs, Prometheus stops scraping.
When an app doesn’t have /metrics of its own#
Some apps (PostgreSQL
, qBittorrent
, Sonarr
and the rest of the *arr family, Tdarr
, Seerr
) don’t speak Prometheus natively. For them, meta.yml
declares an exporter_image:
# sonarr/meta.yml
monitoring_enabled: true
exporter_image: ghcr.io/onedr0p/exportarr:v2.3.0The deploy engine notices this and spins up a sidecar container next to the main app, sharing its network namespace. The sidecar queries the app’s native API and re-emits the numbers as Prometheus metrics. The scrape config points Prometheus at the sidecar’s port instead of the app’s. You never see it — it just works.
Logs: How Every Log Line Reaches Loki#
Logs ride a completely different path — push instead of pull:
- Alloy is a broadcast app . It’s deployed to every managed target automatically, no opt-in
- On each target, Alloy reads
/var/log/journal(the systemd journal — every container’s stdout/stderr and every host service writes here) - It relabels entries with useful tags (service name, unit, host) and forwards them to Loki via
loki.writeathttp://<loki_ip>:3100/loki/api/v1/push - Loki indexes by label and stores the content. In Grafana’s Explore view you can filter by host, service, level, or search the log text directly
Node Exporter gets the same broadcast treatment — deployed everywhere, Prometheus scrapes every instance — so you see hardware metrics per target without configuring anything per target.
Dashboards#
Grafana ships with a curated set of dashboards pre-provisioned from its own psw-apps/grafana/
directory. Currently bundled (exact list in meta.yml → extra_files):
- Alertmanager, Blackbox Exporter, Forgejo, Loki metrics, Node Exporter, PostgreSQL, qBittorrent, Radarr, Sonarr, Traefik
They’re loaded as read-only: you can clone or save-as to edit, but the originals stay pristine across upgrades. Two datasources are auto-provisioned alongside them — Prometheus and Loki — so dashboards and Explore work the moment Grafana finishes starting.
Unlike scrape configs, dashboards don’t aggregate from individual apps’ meta.yml — they’re hand-curated in the Grafana app itself.
Alerts: From “something broke” to Your Phone#
Prometheus evaluates alert rules shipped in psw-apps/prometheus/templates/alert-rules.yml.j2
. The defaults cover:
- Always — InstanceDown, HighMemoryUsage, DiskSpaceLow, HighCpuUsage (these all come from Node Exporter metrics)
- Conditional on what’s deployed — PostgreSQL rules if postgres is installed, Traefik rules if traefik is installed, Loki error rules if loki is installed
When a rule fires, Prometheus sends the alert to Alertmanager. Alertmanager’s config (psw-apps/alertmanager/templates/alertmanager.yml.j2
) auto-configures an ntfy webhook receiver the moment ntfy is deployed — alerts land at the psw-alerts topic. Default alerts repeat every 4 hours; critical ones every 1 hour.
Subscribe your phone’s ntfy app to the psw-alerts topic once, and every alert shows up as a push notification. Optional SMTP email is available too if psw_smtp_host is configured.
Synthetic Probes: Blackbox Exporter#
Prometheus can only scrape apps that expose metrics. Blackbox Exporter fills the gap by probing endpoints like a user would: HTTP, HTTPS, TCP, DNS, ICMP. Prometheus doesn’t auto-generate probe targets — you declare what you want checked via prometheus_extra_scrape_targets.
Typical use: probing your public domain from inside your network to catch certificate issues, or monitoring an external service your apps depend on.
Up-or-Down at a Glance: Uptime Kuma#
Grafana is powerful but dense. Uptime Kuma at https://status.<your-domain> is the user-friendly view: a big grid of green/red dots showing which apps are currently reachable. It’s different from Prometheus in an important way:
- Prometheus answers “how much CPU is Jellyfin using?”
- Uptime Kuma answers “is jellyfin.example.com responding right now?”
Uptime Kuma’s trick: it has a reconciler (psw-apps/psw_apps/uptime_kuma/reconcilers.py
) that automatically creates a monitor for every deployed app that has a subdomain. Add a new app, uptime_kuma auto-gets a green dot for it. It also has a uptime_kuma.ntfy reconciler that wires ntfy notifications when monitors go down.
Homepage Is a Different Thing#
Homepage
is sometimes mentioned in the same breath but belongs to a different category (Infrastructure, not Observability). Homepage is the “links to all my apps in one place” dashboard — Grafana/Loki/Prometheus are the “what are they actually doing?” backend. Both can coexist; they answer different questions.
Key Ideas#
- Auto-discovery for metrics —
monitoring_enabled: truein an app’smeta.ymlis the whole opt-in; the monitoring convention does the rest - Sidecars for apps without
/metrics—exporter_imagespins up a metrics translator next to the app, invisibly - Logs via broadcast — Alloy deploys to every target, ships the systemd journal to Loki ; no per-app config
- Dashboards come with the stack — Grafana ships curated read-only boards for the apps in the catalog; datasources auto-provisioned
- Alerts land on your phone — deploy ntfy and Alertmanager wires the webhook automatically
- Two levels of “is it up?” — Prometheus + Blackbox Exporter for deep metrics and external probes; Uptime Kuma for an at-a-glance status page