Per-Process Namespace Design Note
Design note — RECORD ONLY. Not a Phase-1 item. No kernel changes are proposed for now. This note records intent so a future reader can pick it up; it does not schedule work. The active plan is Phase 1 in Risk Remediation Roadmap — complete the capability/handle model, deliver SMP, move drivers toward restartable userland services. The generalization sketched here would be scheduled there later, if and when a concrete driver exists.
This note records a future generalization of SwiftOS's existing per-process filesystem confinement into true per-process namespaces — a per-process root plus a real mount table — and adopts a lexical path-boundary naming idea as a cosmetic convention. It is written against the live source so the citations can be checked; re-grep before acting on them, since the tree moves.
What we have today
SwiftOS already ships real per-process filesystem confinement, enforced in the
kernel — not a path-syntax convention layered over a shared tree. The pieces, as
they exist in kernel/vfs/vfs.swift and
userland/lib/syscall.h:
- Per-process confinement root array.
private var confineNodes = [Int](repeating: 0, count: maxVFSProcesses)atvfs.swift:260. A value of0means unconfined (the whole namespace, the compatibility default); a non-zero value is the vnode index this process is confined to. SYS_CONFINE = 50, defined atuserland/lib/syscall.h:59; the userland wrapper atsyscall.h:236calls__syscall3(SYS_CONFINE, (long)path, 0, 0).vfsConfine(path:)atvfs.swift:3771. It resolves the path, requires a directory, and is confine-only / monotonic: the new root must be a descendant of the current confinement (isDescendant(node, of: confineNodes[proc])atvfs.swift:3779), so a confined process can never widen its own reach. It also pulls cwd inside the new root if cwd would fall outside it (vfs.swift:3781).- Inheritance across the process lifecycle. On fork/spawn the child inherits
the parent's confinement:
confineNodes[slot] = confineNodes[parent]atvfs.swift:997("a confined parent's child stays confined"). The unconfined paths reset the slot to0(vfs.swift:1012,vfs.swift:1035). - Enforcement is pervasive, not advisory.
confineRootForCurrentProcess()(vfs.swift:1069) andconfinedAllows(_)(vfs.swift:1163, built onisDescendantatvfs.swift:1154) feed confinement checks intovfsOpen(vfs.swift:2040, check aroundvfs.swift:2057–2060) and into the mutating and stat paths. This is the C3 capability described in Capabilities §6.
What is missing for true per-process namespaces is the structure underneath:
- There is one global vnode tree. The fixed node table
(
private let maxNodes = 6144,nodes,nodeCountatvfs.swift:80–82) is a single shared graph. The existing "mount" machinery —buildBaseFromDisk(vfs.swift:666),mountPackageImages(vfs.swift:700),vfsMountDataFs(vfs.swift:1718) — performs build-time grafts into that one tree (base image, package images, the/datadatafs tier). It is not a per-process, runtime, namespace-scoped mount table: there is no(namespaceId, mountpoint) → subtreemapping and no per-process view divergence.
Comparison to prior art
A well-known convention puts the container boundary in the path itself — a
lexical syntax such as /ns::container/..., where a ::-delimited segment names
a sandbox inline in the pathname. It reads as a self-documenting way to say "this
path is rooted inside container X." In practice that syntax is usually design
intent only: the typical implementation behind it has one global archive (a
single tar), one shared open-file table, and no per-process root — the :: is
cosmetic, not an isolation mechanism.
SwiftOS's implemented confinement already exceeds a path-syntax-only design's
isolation: confineNodes is a per-process, inherited, kernel-enforced,
monotonic ceiling on what a process can name, checked on every open and mutation
— not a string convention. We therefore adopt only the naming-convention idea
from that prior art, never its implementation.
The future generalization
Three pieces, sketched as a proposal — not a build plan.
Per-process root override. Generalize
confineNodesfrom a monotonic ceiling into a per-process root that can be rebased (chroot / pivot-root grade), so a process sees its namespace root as/. This is roughly ~90% present already: the per-process slot, inheritance, and pervasive enforcement all exist. The remaining ~10% is the ability to rebase/per process — to resolve/to a process-specific node and let the mount table below diverge — rather than only narrow the reachable set asvfsConfinedoes today.A real mount table keyed by
(namespaceId, mountpoint). A small, fixed table mapping(namespaceId, mountpoint vnode) → subtree root, consulted during path walk, so two processes in different namespaces can see different trees at the same path. Today's grafts are global and build-time; this would be per-namespace and runtime. It must follow the allocation-free, fixed-table style the kernel already uses fornodes/confineNodes(maxNodes-sized, no heap growth on the hot path) and be touched only undervfsLock— the same discipline that protects the existing tables — so the future implementer inherits the right SMP/locking constraints from the start.The lexical-boundary path idea as a convention. Adopt the
::-style namespace-in-path notation purely as a userland / tooling display and diagnostic convention — e.g. howps, logs, or a shell prompt might render a confined process's root — and not as a kernel parsing rule. The kernel never parses::to grant or restrict access. Authority stays capability/handle-based (C1–C3 and handles remain the only authority); the syntax is cosmetic naming for humans and tools, never a security boundary. Path strings neither grant nor restrict reach.
No syscall is added by this note. For a future implementer's reference only:
the syscall table is in userland/lib/syscall.h;
the highest number currently in use is SYS_DEVICE_MMAP = 101
(syscall.h:110), so a future namespace/mount syscall would take the next free
number there and be mirrored in the kernel dispatch. Nothing is allocated or
wired now.
Relationship to existing docs
Capabilities already covers the surrounding picture: §6
documents the C3 object-scoped confinement summarized above, and the Cells
discussion (§5, around lines 497–585, with the CellId tag at line 224) sketches
the longer-horizon per-process VFS namespace + root view and resource domain. It
notes that "the namespace already lives per-process (cwd/root in
vfs.swift)" (CAPABILITIES.md:539). This note does not restate that; it points
to it and focuses specifically on the mount-table + per-process-root
generalization and the lexical-boundary convention. Any real work would be
scheduled in Risk Remediation Roadmap, not here.
Non-goals / when to revisit
- No syscalls, no ABI changes, no kernel work are part of recording this.
- Revisit only when there is a concrete driver for divergent
/views — e.g. multi-tenant hosting that needs two processes to see different filesystems at the same path. Until then, the existing monotonic confinement is sufficient, and adding a per-namespace mount table would be unjustified complexity.