Researching verifiable execution for agentic AI

Tribunus studies how AI agents, inference engines, tools, and distributed runtimes can produce evidence instead of requiring blind trust.

Research thesis

Modern AI safety often focuses on model behavior, model evaluations, and deployment policy. Tribunus focuses on the execution layer beneath agentic systems. The question is not only "what did the model say?" but "what was the agent allowed to do, what state did it mutate, which model and backend produced the result, what evidence was emitted, and can the execution be replayed or audited?"

Research programs

1. Verifiable Inference

PhaseIR compile-time architecture, compute images, oracle validation across backends, deterministic runtime replay, numerical tolerance matrices, and structured failure evidence.

Architecture overview →

2. Governed Agents

Capability-scoped tool execution, approval gates, state-machine orchestration, plugin permission boundaries, and execution receipts.

Documentation →

3. Local-First AI Systems

Agent control planes entirely on developer hardware, privacy-preserving execution, offline-capable workflows, user-controlled provider boundaries.

4. Federated Mutual-Aid Inference (Dharma)

Semi-trusted peer networks for distributed inference, quorum-verified execution receipts, DHT-based capability advertisements.

Designed — research in progress

Evidence registry

Compute image architecture and ADRs — compile-time candidate generation, 6-check admission, oracle validation
Compute Kernel Evidence Corpus — machine-readable benchmark dataset
Tribunus Benchmarks — curated leaderboard
Architecture Decision Records — ADR 0034-0041

Open problems

How to validate quantized kernels across heterogeneous backends without overfitting tolerance thresholds?
How to make agent receipts useful for audit without exposing sensitive project data?
How to verify federated inference across semi-trusted peers with minimal overhead?
How to admit dynamic-shape workloads into a compile-time inference engine?
How to handle nondeterminism in GPU drivers across different architectures?
How to prove plugin authority boundaries in a desktop runtime at the OS level?

Collaborate

Hardware vendors, ML systems researchers, safety researchers, and contributors welcome.

GitHub → · Docs → · research@tribunus.dev

Model	Qwen2.5 0.5B
Layers	24
Tensors	556
Quantization	NF4
Primary backend	MLX Metal GPU
Fallback backend	Accelerate CPU
Verification	Passing — oracle validated