Docs / More / OS self-update (kernel + base image) — Phase 0 audit

OS self-update (kernel + base image) — Phase 0 audit

Goal: update SwiftOS itself (kernel + base image) on a running box, with no Rescue/live-dd — deliver a signed bundle over HTTPS, stage into the inactive A/B slot, atomically switch, health-confirm, auto-rollback on failure.

This document is the mandatory Phase-0 audit: what of the A/B skeleton is real vs stub/missing, before any new code is written.

TL;DR

The A/B mechanism is far more complete than the prompt assumes — both the kernel (ESP) and the base image (data-disk update store) already have signed slots, attempt-based rollback, confirm, and full QEMU test coverage. What is missing is the delivery + coordination + anti-rollback layer the prompt actually asks for:

a signed system-update bundle (kernel+base, monotonic version),
a download/stage path that works on a live box (today base staging needs a physically-attached payload disk; kernel staging can only duplicate the running kernel, never install a new one),
one coordinated boot-selector (kernel and base are selected independently today and can drift),
monotonic anti-rollback (never enforced),
/bin/swupdate os <url> + unified confirm, and make os-update-test.

What is REAL (implemented, tested, in `make test`)

Kernel A/B — ESP side (`boot/efi/loader.c`, `kernel/fs/esp.swift`)

UEFI loader reads a signed SWOSKERN manifest (kernelboot.swift v3 = per-slot SHA-256 + Ed25519) and kernelA.bin/kernelB.bin from \EFI\swift-os on the GPT/ESP disk; verifies the slot hash before jumping.
Writable kernel-state record (SWOSKSTA, SHA-256-protected, not signed): per-slot attemptCount + state (untried/confirmed), seq, lastBooted, mutable active. Loader bumps the attempt counter each boot.
Loader attempt-based rollback (KS_MAX_ATTEMPTS=3): unconfirmed active slot that exhausts its attempts → boot the fallback. (uefi_krollback_test.sh.)
Runtime syscalls (capConsole-gated):
- swos-kstage (68) — FAT32 in-place copy active→inactive kernel slot + verify.
- swos-kactivate (69) — flip active slot in kernel-state, mark untried.
- swos-kconfirm (70) — mark booted kernel slot confirmed.
Tests: uefi_kernel_ab, uefi_kstage, uefi_kactivate, uefi_kattempt, uefi_kconfirm, uefi_krollback — all wired into make test.

Base-image A/B — data-disk update store (`kernel/fs/swosboot.swift`, `updatestore.swift`)

SWOSBOOT manifest: CRC32-protected, double-buffered (LBA 0/1, torn-write safe, highest valid sequence wins). Two slots, each a full signed SWOSBASE-v3 image, plus active/fallback, per-slot state/attemptCount/ generation. Trust boundary documented: manifest is not a trust anchor; it only selects among self-authenticating signed images.
Boot: updateStoreInit selects active slot, records the boot attempt, and does attempt-based rollback (maxBootAttempts=3, U1d) to the fallback; verified-fallback at mount if the active image fails Ed25519.
Power-fail discipline already present: stage slot fully + virtioBlkFlush before the manifest write-back, and the manifest write goes to the other double-buffer copy then flushes (U1h).
Runtime syscalls (capConsole-gated):
- swos-update (67) — copy an attached payload disk into the inactive slot.
- swos-activate (66) — flip active base slot (on trial, untried).
- swos-confirm (65) — confirm booted base slot.
Host builder tools/updatestore.swift; tests ab_update/persist/confirm/rollback/ activate/payload/stage/flush — all in make test.

Site updates (`/bin/swupdate`, SU-A/B/C) — reference pattern, already shipped

SWSITE signed bundle (Ed25519 + SHA-256), fetched over TLS 1.3 by swupdate site <url>, staged into /data/www/next, atomically swapped (current→prev, next→current) with crash recovery on next boot. This is exactly the delivery/stage/atomic-swap shape we need for the OS bundle — reuse it.

GAPS vs the prompt (this is the work to do)

No coordinated boot-selector. Kernel slot (ESP kernel-state) and base slot (data-disk SWOSBOOT) are chosen independently; swos-kactivate and swos-activate are separate commands. Nothing guarantees kernel-A boots with base-A. Prompt requires one atomic flip selecting kernel+base together.
No live delivery of a new OS.
- Base staging reads from a physically-attached read-only payload disk (virtioBlkSelectPayload) — not possible on a live Hetzner box.
- Kernel staging (swos-kstage) only duplicates the currently-running kernel into the inactive slot — there is no path to install a new kernel binary.
- kernelA.bin/kernelB.bin are both built from the same $(KERNEL_BIN) and the FAT32 writer does same-size in-place copy only (ka.1 != kb.1 → EINVAL). A genuinely new kernel of a different size cannot be written in place → needs fixed/padded slot sizing or a FAT cluster (re)allocator. Decision needed.
- /bin/swupdate has seed/apply-local/site only — no os subcommand.
- No combined kernel+base system-bundle format (full or delta).
No monotonic anti-rollback. generation/sequence exist but are not a security version; an older, validly-signed (and possibly vulnerable) image is accepted. Prompt requires a monotonic version refused if ≤ the installed one.
No kernel-mediated "write inactive slot from /data bytes" path. The only privileged slot writes are payload-disk→slot (base) and active→inactive-copy (kernel). Neither takes bytes from a downloaded /data staging file.
No unified confirm / auto-confirm. Confirm is split (swos-confirm + swos-kconfirm), both manual. No /bin/swupdate confirm and no auto-confirm on "services healthy (sshd+nginx up)".
No make os-update-test and — critically — no real-hardware validation. Per this session's lesson (timer netPump passed QEMU, killed the NIC on real Ampere/KVM), every boot/driver-path change here must be checked on the live box via Console before we rely on it.

Proposed staged plan (one submilestone at a time, build+boot+test+commit+stop)

OS-0 (this doc): audit. ← done.
OS-1 Coordinated selector: make the ESP kernel-state (already loader-read, hash-protected, atomically rewritten) the single authority that also names the base slot; updateStoreInit reads the base slot from it. One atomic flip picks both. (Design fork — see below.)
OS-2 System-update bundle format + host tool: SWSYS bundle = monotonic version + signed (IMG_SIGNING_SEED/PUB) header over {kernel image, base image} (full first; delta later). Host builder under tools/, à la sitepack.
OS-3 Kernel-mediated stage-from-/data: capability-gated syscalls that write the inactive base slot and inactive kernel slot from a verified /data staging file (resolve the kernel size-change question here), fsync before flip.
OS-4 /bin/swupdate os <url> + monotonic anti-rollback: HTTPS fetch → verify signature and version > installed → stage both slots → coordinated atomic flip → trial-boot flag + reset attempts → reboot.
OS-5 Health + auto-rollback + confirm: unify confirm (swupdate confirm), optional auto-confirm when sshd+nginx are up; rely on existing loader/kernel attempt-rollback for the no-boot case; verify the no-boot rollback end-to-end.
OS-6 make os-update-test (QEMU) + real-Hetzner runbook and Console check.

Open decisions (need a call before OS-1/OS-3)

Coordinated selector shape: (a) Single authority in ESP — extend kernel-state to carry the base slot; the loader/kernel read both from it (matches "ONE selector" literally; one atomic 512-byte rewrite the loader already does). Recommended. (b) Convention coupling — keep both stores, have swupdate os always stage+activate+confirm A/B in lockstep with a two-phase commit + boot recovery. Less invasive, but two pointers can still drift on a torn write.
New-kernel sizing: (a) fixed/padded kernel slots sized to a max (simple, wastes ESP space) vs (b) a real FAT32 cluster allocator (general, more code).
Bundle contents: full kernel+base every time first; add delta later — OK?

← Back

Namespace design

Settings guide

Edit this page on GitHub