cohesix

Cohesix is an open-source high-assurance control-plane operating system built on the formally verified seL4 microkernel, designed to keep the trusted computing base intentionally small while enabling deterministic orchestration of edge GPU systems and auditable MLOps. Cohesix is "infrastructure for AGI".

View the Project on GitHub lukeb-aidev/cohesix

GPU Nodes — Out-of-VM Acceleration Strategy

At a glance

Related docs

1. Rationale

CUDA/NVML stacks are large and platform-specific. Keeping them outside the seL4 guest (whether running on QEMU or physical UEFI hardware) preserves determinism and minimises the trusted computing base (TCB). The Cohesix instance interacts with GPUs exclusively through a capability-guarded 9P namespace mirrored by host workers. GPU workers (worker-gpu) are another worker type under the hive’s Queen, not standalone services.

Operational dependencies (live)

Quick validation:

./bin/gpu-bridge-host --publish --tcp-host 127.0.0.1 --tcp-port 31337 --auth-token changeme
./bin/cohsh --transport tcp --tcp-host 127.0.0.1 --tcp-port 31337 --role queen <<'COH'
ls /gpu
cat /gpu/bridge/status
COH

2. Model Lifecycle Surfaces (Milestone 6a)

Live publish sequence

sequenceDiagram
  participant Host as gpu-bridge-host
  participant Queen as Queen/NineDoor
  participant VM as /gpu namespace
  Host->>Queen: append snapshot (begin/b64/end) to /gpu/bridge/ctl
  Queen->>VM: install /gpu/<id>/* nodes
  Queen->>VM: install /gpu/models/* + /gpu/telemetry/schema.json
  Note over VM: /gpu/models is absent until publish succeeds

3. Host GPU Worker Architecture

4. Cohesix Namespace Mapping

| Cohesix Path | Backing Action | |———|—————-| | /gpu/<id>/info | Serialize GPU metadata (name, UUID, memory, SM count, driver/runtime versions). | | /gpu/<id>/ctl | Accept textual commands (LEASE, RELEASE, PRIORITY <n>, RESET) and return status lines mediated by the bridge host. | | /gpu/<id>/lease | Ticket/lease file gated by host policy; worker-gpu reads to learn active allocations and writes to request renewals. Append-only JSON lines use schema gpu-lease/v1 (state=ACTIVE|RELEASED). | | /gpu/<id>/status | Read-only view of utilisation and recent job summaries sourced from the host; append-only job lifecycle entries and gpu-breadcrumb/v1 host-run breadcrumbs are included. | | /gpu/bridge/ctl | Append-only publish channel for GPU bridge snapshots (begin/b64:/end lines). | | /gpu/bridge/status | Read-only publish state (state=idle|receiving|ok|err). | | /gpu/models/* | Host-mirrored model registry (available + active). | | /gpu/telemetry/schema.json | Telemetry schema descriptor (read-only). |

Note:

5. Lease Model

pub struct GpuLease {
    pub gpu_id: String,
    pub mem_mb: u32,
    pub streams: u8,
    pub ttl_s: u32,
    pub priority: u8,
}

6. Job Descriptor Schema

{
  "job": "jid-42",
  "kernel": "vadd",
  "grid": [128, 1, 1],
  "block": [256, 1, 1],
  "bytes_hash": "sha256:...",
  "inputs": ["/bundles/vadd.ptx"],
  "outputs": ["/shard/<label>/worker/<id>/result"],
  "timeout_ms": 5000,
  "payload_b64": "..."
}

7. Simulation Path (for CI & macOS)

8. Security Notes

LoRA Feedback Loop Walkthrough

Jetson Nano → Cohesix Worker → Queen → PEFT/LoRA Farm → Queen → Worker → Jetson Nano

This walkthrough describes a pragmatic, end-to-end LoRA optimisation loop using Cohesix as the secure control plane, while keeping CUDA, TensorRT, and training stacks outside the VM and outside the TCB.

The design assumes:

No new IPC mechanisms are introduced. Everything flows through Secure9P namespaces and files.


1. Runtime Inference on Jetson Nano (Outside Cohesix)

Where inference runs

Active model

Why this matters


2. Telemetry Generation (Host → Worker telemetry)

During inference, the host process emits summarised telemetry, not raw data or gradients.

Typical fields:

The host GPU bridge publishes the telemetry schema at /gpu/telemetry/schema.json into the VM. Telemetry records themselves are emitted by host-side tooling and forwarded into /queen/telemetry/* or worker telemetry streams using Secure9P; no /gpu/telemetry/* record files exist inside the VM today.

Properties:

Telemetry Schema (Milestone 6a)


3. Worker Collection & Thinning

Each Jetson runs a Cohesix Worker with a role-scoped ticket.

The worker:

/shard/

Legacy aliases at /worker/<id>/telemetry are available only when sharding.legacy_worker_alias = true.

This step is important on Jetson:


The Worker writes telemetry into the Queen namespace via Secure9P using the OS-named ingest surface:

/queen/telemetry/jetson-42/ ctl seg/seg-000001 latest

Transport characteristics:

If the link drops:


5. Queen Aggregation & Policy Gating

The Queen:

When criteria are met, the Queen exports a LoRA training job:

/queen/export/lora_jobs/job_8932/ telemetry.cbor base_model.ref policy.toml

This directory is the contract boundary between Cohesix and ML tooling. Host operators pull this bundle via coh peft export before handing it to external training.


6. External PEFT / LoRA Training (Out of Band)

A LoRA farm watches /queen/export/lora_jobs/.

This can be:

Cohesix does not:

It only:


7. LoRA Artifact Import (Farm → Host Registry)

The training job produces:

These are staged on the host filesystem and surfaced through the GPU model lifecycle view:

/gpu/models/available/llama3-edge-v7/manifest.toml

The manifest records:


8. Model Distribution to Workers

Workers observe the active model pointer:

/gpu/models/active -> llama3-edge-v7


9. Jetson Hot-Swap or Restart

The host inference process:

Post-deployment telemetry flows immediately, closing the loop.


10. What Cohesix Provides (and What It Doesn’t)

Cohesix provides

Cohesix deliberately does not


11. Minimal Glue Required for Adoption

To deploy this at scale, only a few thin adapters are needed:

Host-side

Cloud-side

Everything else already exists in the protocol.


12. Bottom Line

That’s exactly what you want for real deployment.

Future work (per BUILD_PLAN.md milestones): ticket arbitration across multiple workers, lease renewal/expiry enforcement, GPU worker lifecycle hooks, and CI coverage of the /gpu/<id>/ namespace surface.