OS self-update (kernel + base image) — Phase 0 audit
Goal: update SwiftOS itself (kernel + base image) on a running box, with no
Rescue/live-dd — deliver a signed bundle over HTTPS, stage into the inactive
A/B slot, atomically switch, health-confirm, auto-rollback on failure.
This document is the mandatory Phase-0 audit: what of the A/B skeleton is real vs stub/missing, before any new code is written.
TL;DR
The A/B mechanism is far more complete than the prompt assumes — both the kernel (ESP) and the base image (data-disk update store) already have signed slots, attempt-based rollback, confirm, and full QEMU test coverage. What is missing is the delivery + coordination + anti-rollback layer the prompt actually asks for:
- a signed system-update bundle (kernel+base, monotonic version),
- a download/stage path that works on a live box (today base staging needs a physically-attached payload disk; kernel staging can only duplicate the running kernel, never install a new one),
- one coordinated boot-selector (kernel and base are selected independently today and can drift),
- monotonic anti-rollback (never enforced),
/bin/swupdate os <url>+ unifiedconfirm, andmake os-update-test.
What is REAL (implemented, tested, in make test)
Kernel A/B — ESP side (boot/efi/loader.c, kernel/fs/esp.swift)
- UEFI loader reads a signed
SWOSKERNmanifest (kernelboot.swiftv3 = per-slot SHA-256 + Ed25519) andkernelA.bin/kernelB.binfrom\EFI\swift-oson the GPT/ESP disk; verifies the slot hash before jumping. - Writable
kernel-staterecord (SWOSKSTA, SHA-256-protected, not signed): per-slotattemptCount+state(untried/confirmed),seq,lastBooted, mutableactive. Loader bumps the attempt counter each boot. - Loader attempt-based rollback (
KS_MAX_ATTEMPTS=3): unconfirmed active slot that exhausts its attempts → boot the fallback. (uefi_krollback_test.sh.) - Runtime syscalls (capConsole-gated):
swos-kstage(68) — FAT32 in-place copy active→inactive kernel slot + verify.swos-kactivate(69) — flip active slot inkernel-state, mark untried.swos-kconfirm(70) — mark booted kernel slot confirmed.
- Tests:
uefi_kernel_ab,uefi_kstage,uefi_kactivate,uefi_kattempt,uefi_kconfirm,uefi_krollback— all wired intomake test.
Base-image A/B — data-disk update store (kernel/fs/swosboot.swift, updatestore.swift)
SWOSBOOTmanifest: CRC32-protected, double-buffered (LBA 0/1, torn-write safe, highest validsequencewins). Two slots, each a full signedSWOSBASE-v3 image, plusactive/fallback, per-slotstate/attemptCount/generation. Trust boundary documented: manifest is not a trust anchor; it only selects among self-authenticating signed images.- Boot:
updateStoreInitselects active slot, records the boot attempt, and does attempt-based rollback (maxBootAttempts=3, U1d) to the fallback; verified-fallback at mount if the active image fails Ed25519. - Power-fail discipline already present: stage slot fully +
virtioBlkFlushbefore the manifest write-back, and the manifest write goes to the other double-buffer copy then flushes (U1h). - Runtime syscalls (capConsole-gated):
swos-update(67) — copy an attached payload disk into the inactive slot.swos-activate(66) — flip active base slot (on trial, untried).swos-confirm(65) — confirm booted base slot.
- Host builder
tools/updatestore.swift; testsab_update/persist/confirm/rollback/ activate/payload/stage/flush— all inmake test.
Site updates (/bin/swupdate, SU-A/B/C) — reference pattern, already shipped
SWSITEsigned bundle (Ed25519 + SHA-256), fetched over TLS 1.3 byswupdate site <url>, staged into/data/www/next, atomically swapped (current→prev, next→current) with crash recovery on next boot. This is exactly the delivery/stage/atomic-swap shape we need for the OS bundle — reuse it.
GAPS vs the prompt (this is the work to do)
No coordinated boot-selector. Kernel slot (ESP
kernel-state) and base slot (data-diskSWOSBOOT) are chosen independently;swos-kactivateandswos-activateare separate commands. Nothing guarantees kernel-A boots with base-A. Prompt requires one atomic flip selecting kernel+base together.No live delivery of a new OS.
- Base staging reads from a physically-attached read-only payload disk
(
virtioBlkSelectPayload) — not possible on a live Hetzner box. - Kernel staging (
swos-kstage) only duplicates the currently-running kernel into the inactive slot — there is no path to install a new kernel binary. kernelA.bin/kernelB.binare both built from the same$(KERNEL_BIN)and the FAT32 writer does same-size in-place copy only (ka.1 != kb.1→ EINVAL). A genuinely new kernel of a different size cannot be written in place → needs fixed/padded slot sizing or a FAT cluster (re)allocator. Decision needed./bin/swupdatehasseed/apply-local/siteonly — noossubcommand.- No combined kernel+base system-bundle format (full or delta).
- Base staging reads from a physically-attached read-only payload disk
(
No monotonic anti-rollback.
generation/sequenceexist but are not a security version; an older, validly-signed (and possibly vulnerable) image is accepted. Prompt requires a monotonic version refused if ≤ the installed one.No kernel-mediated "write inactive slot from /data bytes" path. The only privileged slot writes are payload-disk→slot (base) and active→inactive-copy (kernel). Neither takes bytes from a downloaded
/datastaging file.No unified
confirm/ auto-confirm. Confirm is split (swos-confirm+swos-kconfirm), both manual. No/bin/swupdate confirmand no auto-confirm on "services healthy (sshd+nginx up)".No
make os-update-testand — critically — no real-hardware validation. Per this session's lesson (timernetPumppassed QEMU, killed the NIC on real Ampere/KVM), every boot/driver-path change here must be checked on the live box via Console before we rely on it.
Proposed staged plan (one submilestone at a time, build+boot+test+commit+stop)
- OS-0 (this doc): audit. ← done.
- OS-1 Coordinated selector: make the ESP
kernel-state(already loader-read, hash-protected, atomically rewritten) the single authority that also names the base slot;updateStoreInitreads the base slot from it. One atomic flip picks both. (Design fork — see below.) - OS-2 System-update bundle format + host tool:
SWSYSbundle = monotonicversion+ signed (IMG_SIGNING_SEED/PUB) header over {kernel image, base image} (full first; delta later). Host builder undertools/, à lasitepack. - OS-3 Kernel-mediated stage-from-/data: capability-gated syscalls that write
the inactive base slot and inactive kernel slot from a verified
/datastaging file (resolve the kernel size-change question here), fsync before flip. - OS-4
/bin/swupdate os <url>+ monotonic anti-rollback: HTTPS fetch → verify signature and version > installed → stage both slots → coordinated atomic flip → trial-boot flag + reset attempts → reboot. - OS-5 Health + auto-rollback +
confirm: unify confirm (swupdate confirm), optional auto-confirm when sshd+nginx are up; rely on existing loader/kernel attempt-rollback for the no-boot case; verify the no-boot rollback end-to-end. - OS-6
make os-update-test(QEMU) + real-Hetzner runbook and Console check.
Open decisions (need a call before OS-1/OS-3)
Coordinated selector shape: (a) Single authority in ESP — extend
kernel-stateto carry the base slot; the loader/kernel read both from it (matches "ONE selector" literally; one atomic 512-byte rewrite the loader already does). Recommended. (b) Convention coupling — keep both stores, haveswupdate osalways stage+activate+confirm A/B in lockstep with a two-phase commit + boot recovery. Less invasive, but two pointers can still drift on a torn write.New-kernel sizing: (a) fixed/padded kernel slots sized to a max (simple, wastes ESP space) vs (b) a real FAT32 cluster allocator (general, more code).
Bundle contents: full kernel+base every time first; add delta later — OK?