Local AI#

What Is It?#

Local AI means you run a chatbot — and eventually voice, image generation, photo search, and document tagging — on your own hardware, with the data staying inside your house. No subscriptions. No cloud. The kid asks the speaker to turn off the kitchen light, the family chats about dinner with a private assistant, the photo app finds “beach 2018” — and none of it ever leaves the box.

Think of it like having your own private chatbot living in the basement. You walk past, ask a question, it answers. The conversation is on a hard drive you own. Nobody else sees it.

The Two Apps That Make It Work#

In Phase 1 (the first wave) PSW ships two AI apps , working as a pair:

App	Role	Where it lives
Ollama	The brain. Holds the AI models on disk and answers questions. No web page of its own — it’s a backend	`psw-apps/psw_apps/ollama/`
Open WebUI	The face. The chat page you actually visit at `chat.<your-domain>`, plus the OpenAI-compatible API other apps can call into	`psw-apps/psw_apps/openwebui/`

Why two and not one: Ollama is the engine that runs the heavy math; Open WebUI is the friendly chat page on top. Splitting them lets future AI engines (vLLM for fast batch jobs, ComfyUI for image generation) plug in behind the same chat UI without rewriting anything else.

What You Need#

A real GPU. Not optional. PSW’s AI capability supports two vendor paths:

Vendor	Minimum card	Notes
NVIDIA	Discrete card with ≥8 GB VRAM — used RTX 3060 12 GB is a sensible floor	The most-tested path. Driver pinned to branch 580 (see phase-0-gpu-plumbing.md ). Requires secure-boot disabled.
AMD (ROCm )	Discrete card with ≥8 GB VRAM — Radeon RX 6800 / 6800 XT / 6900 XT / 7900 XT / 7900 XTX, MI-series datacentre cards	No driver install (the `amdgpu` module is built into the mainline Linux kernel that Proxmox 9 ships). ROCm runtime ships inside each AI container image. Secure boot is fine.

CPU-only AI is a deliberate non-goal. It’s slow enough on small models that the experience feels broken, and PSW’s job is to make a production-grade self-hosted home server — see philosophy.md — not to lie to you about what your hardware can do. If you have no AI-class GPU, the AI apps refuse to install with a clear hardware-floor message.

Intel Arc / oneAPI is Phase 4 — not yet supported. Intel iGPUs (N100, N305, etc.) work fine for media transcoding (Jellyfin, Tdarr) but are not currently a path to local AI.

A bigger box than the rest of PSW expects. AI is GPU-bound, and the GPU lives on a node. That node also needs the GPU plumbing configured, an unprivileged LXC for the AI apps, and a fast local NVMe drive for the model weights (more on this below).

For NVIDIA only: secure-boot-disabled UEFI/BIOS on that node. PSW installs NVIDIA’s driver from NVIDIA’s apt repo; secure boot would refuse to load those kernel modules. AMD doesn’t need this — amdgpu is signed and accepted by secure-boot trust chains.

What Runs on the Host#

Before any AI app can deploy, the Proxmox node hosting the AI workload needs the GPU plumbing in place. PSW handles this with a single command:

psw node setup-gpu <node>

The command auto-detects the GPU vendor and runs the right path:

NVIDIA path#

Installs the pinned NVIDIA driver (currently branch 580 — see phase-0-gpu-plumbing.md for why).
Installs nvidia-container-toolkit — the bridge between the host’s GPU and a container that wants to use it.
Generates a CDI (Container Device Interface) spec at /etc/cdi/nvidia.yaml. CDI is the modern standard for handing devices to containers — declarative, no per-container runtime hooks, no environment variables, just --device=nvidia.com/gpu=all and the runtime knows what to mount.

AMD path#

Much shorter, because AMD’s host story is simpler:

Verifies the amdgpu kernel module is loaded (it ships built-in with mainline Linux).
Verifies /dev/kfd (the HSA kernel fusion driver interface) is present.
Captures the host’s render and video group GIDs so the LXC passthrough can map them correctly.
No driver install (amdgpu is mainline). No container-toolkit install (ROCm runtime ships inside each AI container image — ollama/ollama:0.22.0-rocm etc.). No CDI spec (AMD’s container ecosystem hasn’t adopted CDI yet; passthrough is plain --device= flags).

Common to both#

PSW records what it found in nodes/<n>/gpu.yml (with a vendor: nvidia or vendor: amd discriminator). Convergence reads that file when deploying an AI app’s LXC :

For NVIDIA, it installs the exact-matching userspace libraries inside the container — driver version skew between host kernel module and container userspace breaks nvidia-smi with a confusing error message.
For AMD, no userspace install is needed (ROCm is in the container); convergence just maps the host’s render/video GIDs into the LXC for pct set -devN gid=....

Either way, PSW refuses to ship a deploy where the vendor flag in the deployment plan and the recorded gpu.yml vendor disagree.

If you skip setup-gpu and try to deploy an AI app, convergence refuses with a remediation hint pointing back here.

Where the Models Live#

AI models are big. A typical household ends up with 40–200 GB of model weights — Llama 3.1 8B, Qwen 2.5 7B, an embedding model for RAG , maybe a coder variant, maybe a 70B beast for the GPU-rich.

PSW gives them their own storage class — models — with ZFS tuning specifically for the access pattern (large sequential reads, memory-mapped loads, never written-to once downloaded):

Fastest local NVMe pool only. Never NFS . Loading a model over NFS is too slow to be a real experience.
recordsize=1M, compression=off (model weights are already pre-compressed binary tensors), atime=off.
Weekly snapshots max — model files don’t change once pulled, so frequent snapshots waste space.
Backup-excluded by default. The models are re-downloadable from the upstream registry; the chat history and per-user RAG corpus is what we actually back up. See backups.md .

The AI planner sizes the AI target’s models dataset based on how many AI apps you’ve selected. Default 100 GB.

Pulling Your First Model#

Ollama ships empty by default — no pre-bundled models. Vision (see docs/plans/ai-apps/vision.md § “Phasing”) was deliberate about this: the household admin picks the models that match their hardware and use case, rather than PSW guessing.

Once Open WebUI is up and you’ve logged in as the LLDAP admin (the akadmin account from SSO bootstrap ):

Click your profile picture > Admin Panel.
Settings > Models.
Type a model name in Pull a model from Ollama.com — see Ollama’s model library for the full list.
Click pull, wait for the download.

The model lands on the models ZFS dataset. From now on every household member with an LLDAP account can chat with it from the model selector at the top of the chat page.

Recommended starter set (subject to your hardware budget):

Model	VRAM (~)	Why
`llama3.1:8b-instruct-q4_K_M`	5 GB	Solid general-purpose chat. Strong on reasoning and code.
`qwen2.5:7b-instruct`	5 GB	Excellent at structured output (JSON, function calls); good for the planner local-backend .
`nomic-embed-text`	1 GB	Tiny embedding model for RAG; almost always worth pulling alongside a chat model.

A 12 GB GPU runs any one of those comfortably with room for a system prompt and a long context. For a 24 GB GPU you can step up to 13B-class chat models or run two models side-by-side (one chat, one embedder).

The Front Door#

Browser users hit https://chat.<your-domain>/ and get bounced to Authelia for login. Once authenticated they land on the chat page, with their LLDAP role determining whether they see the admin panel.

Group-based role mapping (see sso.md § Role mapping ):

LLDAP group	Open WebUI role	What you can do
`lldap_admin`	admin	Pull / delete models, manage users, mint API keys, see all chats
(any other LLDAP group)	user	Chat with the models the admin has pulled, manage your own chats

The role is re-evaluated every login from the live LLDAP group claim. There’s no first-user-wins admin promotion: removing someone from lldap_admin in LLDAP demotes them on their next login. This is documented in sso.md as the general PSW pattern; AI is the first concrete consumer.

The Side Door — `chat.<domain>/api`#

Browser users use the chat page; other apps use Open WebUI’s REST surface at chat.<your-domain>/api (Open WebUI’s OpenAI-shape /api/chat/completions is what every bearer-token consumer POSTs to). This is what lets Home Assistant voice (Phase 2), Paperless-ngx document tagging (Phase 3), and any future cloud-AI-replacement consumer talk to your local Ollama as if it were OpenAI’s API. Same SDK, same endpoints, different host.

Each consumer gets its own service-user API key — auto-minted by PSW into Open WebUI’s user table, stored encrypted in secrets/apps.yml, presented as a bearer token on every request. Per-consumer rather than shared because:

A leaked HA voice key is reissued without invalidating Paperless’s key.
The admin panel surfaces request volume per key — abuse shows up in one bucket.
psw deploy reset rotates each key independently.

The keys are minted by the openwebui setup reconciler (psw_apps.openwebui.setup.OpenWebUIApiKeysReconciler) on the first convergence tick after the consumer is added. It execs into the running openwebui container and calls Open WebUI’s own SQLAlchemy ORM to create a per-consumer service user (role admin, email psw-<consumer>@psw.local) and write a fresh sk-<uuid4hex> api_key row. The key lands in apps.<consumer>.<consumer>_openwebui_service_key after the setup-callback dispatcher merges it into secrets/apps.yml — no manual mint step.

Consumer reconcilers call psw_apps.openwebui.api_keys.ensure_service_user_key() to obtain and validate the key on every tick. If a key has been revoked manually in the admin panel, the helper raises DriftSkip and the next openwebui setup tick auto-rotates it.

What Stays on the Box#

PSW makes one promise about local AI louder than any other: the data stays here. Concretely (see docs/plans/ai-apps/vision.md § Privacy for the full invariant table):

Data	Where it lives
Every prompt and response in every chat	Open WebUI’s table in shared core Postgres
Per-user RAG corpus (uploaded docs, conversation memory)	Open WebUI’s data volume today — in-process ChromaDB . The shared Postgres image now bundles PGVector and VectorChord , so a follow-up moves Open WebUI’s vectors into the shared database
Model weights	The `models` ZFS dataset on local NVMe
Service-user API keys	`secrets/apps.yml`, SOPS -encrypted with your age key

What never happens silently:

No default-on cloud routing. Open WebUI lets users add cloud connections (OpenAI, Anthropic, OpenRouter ) per-account, but they’re disabled at install time and there is no fallback path that quietly hops to a cloud model when local inference is slow or fails. This is enforced by ENABLE_OPENAI_API=false in Open WebUI’s env.
No telemetry phoning home. OLLAMA_NOHISTORY=true, ANONYMIZED_TELEMETRY=false on Open WebUI. Same posture for every AI app added later.
No “improve the model” data sharing. None of the engines we ship do this; if a future engine does, it’s disabled in meta.yml or it doesn’t ship.

The single escape hatch: an individual user can opt their own account into a cloud model by pasting their own API key in Settings > Connections. Their chats with that model go to that vendor, by their explicit deliberate choice.

Adding the Stack#

The two AI apps land together via the @ai stack :

psw app add @ai

That adds Ollama and Open WebUI to your project. Then the usual pipeline:

psw deploy converge

The AI planner places both apps on the AI-class target (the LXC bound to your AI-class GPU). Convergence provisions the LXC, applies the GPU plumbing per gpu.yml, ships the quadlets , and runs the readiness probes until both apps answer.

First-run timing: about 3-5 minutes of model-free deploy on a clean cluster. Pulling your first model is another 5-15 minutes depending on its size and your internet connection.

Voice — “Hey Casa, turn off the kitchen”#

Phase 2 ships fully local voice control. Two new apps, one Home Assistant integration, zero cloud calls:

App	Role
Whisper (Wyoming Faster-Whisper)	Speech-to-text. Listens on a Wyoming protocol TCP port (10300), takes audio from Home Assistant’s voice pipeline, returns transcribed text. CPU-only by default; GPU-accelerated when an AI-class card is available.
Piper (Wyoming Piper)	Text-to-speech. Listens on Wyoming TCP port 10200, takes text from Home Assistant, returns synthesised audio. CPU-only — Piper synthesises faster than realtime even on a Raspberry-Pi-class CPU; the GPU stays free for LLM workloads.

The integration glue lives in homeassistant/integrations.py :

homeassistant.wyoming_whisper registers Whisper as Home Assistant’s STT engine — the household admin sees it in Settings > Voice assistants > Add assistant > Speech-to-text.
homeassistant.wyoming_piper registers Piper as the TTS engine — same Settings page.
homeassistant.openai_conversation points Home Assistant’s conversation agent at chat.<your-domain>/api (Open WebUI’s /api/chat/completions endpoint) with a per-consumer service-user API key. The household member’s voice command goes: microphone → Whisper STT → conversation agent → Open WebUI’s /api → Ollama → Open WebUI → Home Assistant action → Piper TTS → speaker. Every byte stays inside the box.

Add the voice stack:

psw app add @voice

The @voice stack pulls in Whisper, Piper, Home Assistant, Mosquitto, Ollama, and Open WebUI together — the full chain needed for working voice control. If you already have @ai deployed, @voice adds only what’s missing.

Set up the conversation key. PSW auto-mints ha_openwebui_service_key on the openwebui target via the openwebui setup reconciler — no operator action required. The next convergence tick after psw app add @voice wires Home Assistant’s conversation agent at chat.<domain>/api and the voice pipeline lights up. Default model is llama3.1:8b-instruct-q4_K_M — override via homeassistant_conversation_model in services/homeassistant/defaults.yml.

Privacy invariants for voice (from vision.md § Privacy ):

Voice clips are never persisted. Whisper transcribes audio in-memory and discards it; only the resulting text reaches Home Assistant’s logbook.
Home Assistant’s voice pipeline is configured to NEVER use HA Cloud’s voice processing. The integration explicitly points STT/TTS at the local Wyoming endpoints.
The conversation agent is the local Open WebUI relay, not OpenAI. The reconciler validates the local URL on every apply; setting it to a cloud endpoint would require a deliberate config-flow override outside PSW.

What This Doesn’t Cover (Yet)#

Phase 1 + Phase 2 ship chat + voice. Other AI capabilities are tracked in docs/plans/ai-apps/vision.md and land in subsequent phases:

AI-enabled application stacks (Phase 3, all shipped) — @photos (Immich + machine-learning sidecar for CLIP semantic search), @office (Paperless-ngx + Tika + Gotenberg + paperless-ai LLM tagging + paperless-gpt vision-LLM OCR), and @cctv (Frigate with neural-network object detection on Coral USB or NVIDIA / AMD GPU). Each stack replaces a household-scale cloud vendor.
vLLM (Phase 3, shipped) — a second LLM engine optimised for concurrent requests; lands as a second connection in Open WebUI’s backend list — paperless-ai’s tagging-burst workload is exactly the scenario where vLLM’s continuous batching pays off vs Ollama’s queue model.
ComfyUI (Phase 3, shipped) — graph-based diffusion / image-gen UI at comfyui.<your-domain>. Heaviest VRAM consumer in the catalogue; on a 24 GB AI-class GPU you can run FLUX.1-schnell (4-step image generation in 4–6 seconds), SDXL, video models like Wan 2.2 or HunyuanVideo. Authelia gates the UI per the standard PSW posture (ComfyUI itself has no native auth — never expose it directly).
Intel GPU acceleration (Phase 4) — Intel oneAPI for Arc / Battlemage discrete cards. NVIDIA and AMD ROCm both ship today; Intel discrete is the remaining vendor.

The vision doc is the source of truth for what’s coming and why. It locks the structural decisions; phase-N implementation plans turn those decisions into PRs.