Cohesix is an open-source high-assurance control-plane operating system built on the formally verified seL4 microkernel, designed to keep the trusted computing base intentionally small while enabling deterministic orchestration of edge GPU systems and auditable MLOps. Cohesix is "infrastructure for AGI".
At a glance
/gpu/* namespace published by gpu-bridge-host.worker-gpu reads tickets and leases; it never touches device nodes or CUDA/NVML./gpu/models and /gpu/telemetry/schema.json are absent until a publish completes.Related docs
docs/INTERFACES.md — canonical /gpu/* and /queen/ctl schemas.docs/ROLES_AND_SCHEDULING.md — role-to-namespace rules.docs/HOST_TOOLS.md — host bridge and publish workflows.docs/SECURE9P.md — transport invariants and bounds.CUDA/NVML stacks are large and platform-specific. Keeping them outside the seL4 guest (whether running on QEMU or physical UEFI hardware) preserves determinism and minimises the trusted computing base (TCB). The Cohesix instance interacts with GPUs exclusively through a capability-guarded 9P namespace mirrored by host workers.
GPU workers (worker-gpu) are another worker type under the hive’s Queen, not standalone services.
/gpu/bridge/ctl./queen/ctl writes.ONLINE or DEGRADED.Quick validation:
./bin/gpu-bridge-host --publish --tcp-host 127.0.0.1 --tcp-port 31337 --auth-token changeme
./bin/cohsh --transport tcp --tcp-host 127.0.0.1 --tcp-port 31337 --role queen <<'COH'
ls /gpu
cat /gpu/bridge/status
COH
/gpu/models/available/<model_id>/manifest.toml (read-only)/gpu/models/active (append-only pointer; host swaps atomically)/gpu/models/active and annotates telemetry with model_id / lora_id but cannot upload artefacts./gpu/models is published into the live VM by the host GPU bridge via /gpu/bridge/ctl; it is absent until the publish step completes.sequenceDiagram
participant Host as gpu-bridge-host
participant Queen as Queen/NineDoor
participant VM as /gpu namespace
Host->>Queen: append snapshot (begin/b64/end) to /gpu/bridge/ctl
Queen->>VM: install /gpu/<id>/* nodes
Queen->>VM: install /gpu/models/* + /gpu/telemetry/schema.json
Note over VM: /gpu/models is absent until publish succeeds
/gpu/bridge/ctl over the TCP console (queen role); no CUDA/NVML components enter the VM profile or hardware deployment.| Cohesix Path | Backing Action |
|———|—————-|
| /gpu/<id>/info | Serialize GPU metadata (name, UUID, memory, SM count, driver/runtime versions). |
| /gpu/<id>/ctl | Accept textual commands (LEASE, RELEASE, PRIORITY <n>, RESET) and return status lines mediated by the bridge host. |
| /gpu/<id>/lease | Ticket/lease file gated by host policy; worker-gpu reads to learn active allocations and writes to request renewals. Append-only JSON lines use schema gpu-lease/v1 (state=ACTIVE|RELEASED). |
| /gpu/<id>/status | Read-only view of utilisation and recent job summaries sourced from the host; append-only job lifecycle entries and gpu-breadcrumb/v1 host-run breadcrumbs are included. |
| /gpu/bridge/ctl | Append-only publish channel for GPU bridge snapshots (begin/b64:/end lines). |
| /gpu/bridge/status | Read-only publish state (state=idle|receiving|ok|err). |
| /gpu/models/* | Host-mirrored model registry (available + active). |
| /gpu/telemetry/schema.json | Telemetry schema descriptor (read-only). |
Note:
/gpu/models and /gpu/telemetry/schema.json appear only after a host GPU bridge publish; before that ls /gpu/models returns ERR with invalid-path./gpu/bridge/ctl is single-writer; concurrent publishers must be serialized to avoid interleaved snapshot lines.pub struct GpuLease {
pub gpu_id: String,
pub mem_mb: u32,
pub streams: u8,
pub ttl_s: u32,
pub priority: u8,
}
Permission./gpu/<id>/lease appends JSON lines with schema gpu-lease/v1 and fields: schema, state, gpu_id, worker_id, mem_mb, streams, ttl_s, priority. The latest state=ACTIVE line indicates the current lease./queen/ctl to create GPU workers and manage leases within the same hive orchestration model.{
"job": "jid-42",
"kernel": "vadd",
"grid": [128, 1, 1],
"block": [256, 1, 1],
"bytes_hash": "sha256:...",
"inputs": ["/bundles/vadd.ptx"],
"outputs": ["/shard/<label>/worker/<id>/result"],
"timeout_ms": 5000,
"payload_b64": "..."
}
payload_b64 is present the bridge decodes and hashes the inline bytes.timeout_ms triggers job cancellation; status stream records ERR TIMEOUT or includes the failure in /gpu/<id>/status.QUEUED, RUNNING, and OK entries in /gpu/<id>/status alongside worker telemetry updates.
GPU workers do not schedule hardware directly; they receive tickets and leases from the host over Secure9P, and all scheduling policy (queueing, eviction, throttling) runs on the host side of the bridge.gpu-bridge-host --mock --list emits deterministic namespace descriptors consumed by NineDoor via install_gpu_nodes.dev-virt QEMU runs without a host bridge, the root-task seeds mock /gpu/<id> entries (GPU-0/GPU-1) with info, lease, and status to satisfy CLI demos; /gpu/models and /gpu/telemetry/schema.json appear only after a host GPU bridge publish.info returns synthetic GPU entries, job triggers precomputed status sequences.cohsh and Secure9P; no separate ad-hoc GPU control protocol exists inside the Cohesix instance./gpu/* paths are issued only to WorkerGpu roles./log/queen.log with ticket IDs for audit.Jetson Nano → Cohesix Worker → Queen → PEFT/LoRA Farm → Queen → Worker → Jetson Nano
This walkthrough describes a pragmatic, end-to-end LoRA optimisation loop using Cohesix as the secure control plane, while keeping CUDA, TensorRT, and training stacks outside the VM and outside the TCB.
The design assumes:
No new IPC mechanisms are introduced. Everything flows through Secure9P namespaces and files.
Where inference runs
Active model
Why this matters
During inference, the host process emits summarised telemetry, not raw data or gradients.
Typical fields:
The host GPU bridge publishes the telemetry schema at /gpu/telemetry/schema.json into the VM. Telemetry records themselves are emitted by host-side tooling and forwarded into /queen/telemetry/* or worker telemetry streams using Secure9P; no /gpu/telemetry/* record files exist inside the VM today.
Properties:
model_id, lora_id, device_id, time_window, and schema_version (gpu-telemetry/v1)./gpu/telemetry/schema.json (read-only)gpu-telemetry/v1schema_version, device_id, model_id, time_window, token_count, latency_histogramlora_id, confidence, entropy, drift, feedback_flags/queen/telemetry/* and /queen/export/lora_jobs/* by host tools; no in-VM ML stack is introduced.Each Jetson runs a Cohesix Worker with a role-scoped ticket.
The worker:
/worker/<id>/telemetrymodel_id / lora_id from /gpu/models/active into every forwarded record/queen/telemetry/*/shard/
Legacy aliases at /worker/<id>/telemetry are available only when sharding.legacy_worker_alias = true.
This step is important on Jetson:
The Worker writes telemetry into the Queen namespace via Secure9P using the OS-named ingest surface:
/queen/telemetry/jetson-42/ ctl seg/seg-000001 latest
Transport characteristics:
If the link drops:
The Queen:
When criteria are met, the Queen exports a LoRA training job:
/queen/export/lora_jobs/job_8932/ telemetry.cbor base_model.ref policy.toml
This directory is the contract boundary between Cohesix and ML tooling.
Host operators pull this bundle via coh peft export before handing it to external training.
A LoRA farm watches /queen/export/lora_jobs/.
This can be:
Cohesix does not:
It only:
The training job produces:
adapter.safetensorslora.json (rank, alpha, target layers)These are staged on the host filesystem and surfaced through the GPU model lifecycle view:
/gpu/models/available/llama3-edge-v7/manifest.toml
The manifest records:
Workers observe the active model pointer:
/gpu/models/active -> llama3-edge-v7
The host inference process:
Post-deployment telemetry flows immediately, closing the loop.
Cohesix provides
Cohesix deliberately does not
To deploy this at scale, only a few thin adapters are needed:
gpu-bridge-host
/gpu/<id>/*, /gpu/models/*, and /gpu/telemetry/schema.json via /gpu/bridge/ctl/gpu/models (host registry) for active pointer changeshost-ticket-agent
/host/tickets/spec (host-ticket/v1) and executes allowlisted actions.gpu.lease.grant|renew|release) using existing /queen/ctl, /queen/lease/ctl, and /gpu/<id>/lease.peft.import|activate|rollback) using the same host registry and /gpu/models/* pointers as coh peft./host/tickets/status and /host/tickets/deadletter for evidence replay.coh peft (host tool)
coh peft export pulls /queen/export/lora_jobs/* into host storagecoh peft import stages adapters into the registry backing /gpu/models/available/*Everything else already exists in the protocol.
That’s exactly what you want for real deployment.
Future work (per BUILD_PLAN.md milestones): ticket arbitration across multiple workers, lease renewal/expiry enforcement, GPU worker lifecycle hooks, and CI coverage of the /gpu/<id>/ namespace surface.