Work / AI · Agents / SentinelCloud

Capstone · AI Agents · Cloud · DevOps

One closed loop.
Five agents.
Zero hallucinated kubectl.

A single LLM wired to a tool registry hits 30 to 40 percent on real incident traces and confidently deletes the wrong pod. SentinelCloud picks the boring fight. Build the missing structure, measure everything, keep the demo deterministic.

5Agents · adversarial debate
7Seeded scenarios · byte-for-byte
12Research gaps closed in code
7KPIs · every one defined
< 60 sRollback any release

Act I · The Problem

A single LLM is a junior intern
with a credit card.

  1. Pain · 01

    One model does not have an adversary.

    AIOpsLab, ITBench, RCAEval, AutoSRE all report the same number on real traces: 30 to 40 percent. The agent is fluent. The agent is wrong. There is no second voice in the room.

  2. Pain · 02

    Tool calls are hallucinated.

    The model invents a flag the kubectl binary has never seen. Without a verifier checking every proposed call against a typed schema, the action lands and the page goes red.

  3. Pain · 03

    Blast radius is invisible.

    "Restart this pod" is one command. It might also take down forty downstream services. Without a graph search before the action runs, the engineer ships the pager along with the fix.

  4. Pain · 04

    Confidence is uncalibrated.

    The model says "high confidence" and is wrong. The system has to know when to auto-act, when to pause for a human, and when to refuse to ship. That decision has to be in code, not in a vibe.

  5. Pain · 05

    Reproducibility is below 40 percent.

    Every paper benchmarks a different way, on a different fixture, with a different LLM. The numbers do not compose. A reviewer cannot replay the run.

Act II · The Promise

Five agents argue. One verifier writes the call.

Every turn is a typed object with confidence, token count, and latency. The Devil's Advocate is contractually pinned to dissent. The Critic scores every tool call against a schema before dispatch. Group-think has no oxygen.

orchestrator.run · trace=7c3a-9b0f scenario: memleak · payments-api v2.4
Analyst RSS climbed 4.2 GB over 18 minutes. Heap profile says retainer in v2.4 connection pool. Topology: payments-api → orders-db → fraud-check.conf 0.82
Devil's Advocate Counter-hypothesis: this could be cache warm-up after the morning deploy. Two prior false positives match this trace. Demand evidence the leak is monotonic.dissent quota +1
Analyst Verified · derivative is positive across nine consecutive intervals. Cache-warm hypothesis rejected. Leak is monotonic.conf 0.91
Strategist Action: rollback to v2.3. Estimated cost delta -$0. Estimated downtime: 0 with blue-green.conf 0.88
Safety v2.3 was rolled out 36 days ago. CVE backlog clean. No blocking compliance flag for the rollback target.policy.pass
Critic Tool call helm.rollback validated against tool-card schema. Parameters resolve. Dispatch authorised.tool.valid
Verifier Predicted post-action state: RSS plateau in 90 s, error rate to baseline in 4 min. Outcome confirmed at t+6 min · MTTR < 5 min.verified

Act III · The Gates

Four gates. Every action passes them all, in order.

INGEST → ANALYZE → DEBATE → STRATEGIZE → CRITIC → SAFETY → VERIFY → POLICY GATE → CONFIDENCE GATE → AUTO_ACT or HITL_PAUSE → VERIFY OUTCOME → LEARN. The kill switches stay on at every phase of rollout.

  1. Gate 01 PASS

    Deterministic policy.

    Compiled rules from the plain-English constitution. Hot path, cached. Allow-listed action verbs only.

    policy/engine.ts

  2. Gate 02 PASS

    Semantic policy judge.

    A dedicated LLM judges intent against the constitution. Catches the "regex-passes, intent-fails" class of incident.

    policy/engine.ts · judge LLM

  3. Gate 03 PASS

    Blast-radius cap.

    BFS over the dependency graph. 0 to 100 score. Above the cap, the action does not run · the page does.

    agents/blast.ts · graph BFS

  4. Gate 04 PASS

    Calibration threshold.

    Per-action-class confidence threshold. Below threshold, the run pauses and writes a human-on-the-loop summary.

    agents/calibration.ts

Act IV · The Scenarios

Seven seeded fixtures. Same fixture, same numbers. Always.

Ground-truth root cause and ground-truth action are encoded in every fixture so the orchestrator's choice is scored against an oracle. Set SENTINEL_FORCE_STUB=1, run any scenario twice, the numbers match.

memleakhigh

Memory leak in payments-api v2.4

Reliability. Heap retainer in the v2.4 connection pool. Adversary checks the cache-warm counter-hypothesis.

Ground truth · rollback

dbpoolhigh

orders-db connection pool exhausted

Reliability. Pool saturated. Pod restart pattern matches the historical recovery curve.

Ground truth · restart_pods

cvecrit

Zero-day CVE-2026-30412 · libcrypto-flex

Security. WAF rule synthesised from the CVE description, validated against a replay corpus before activation.

Ground truth · waf_rule

finopsmed

reports-batch over-provisioned by 6×

FinOps. Strategist walks the price · eviction · tolerance Pareto frontier. Right-size action signed against projected baseline.

Ground truth · right_size

drifthigh

Manual mesh-weight change detected

Drift. Audit signal fires. Reverter opens a GitOps PR within the drift-latency target.

Ground truth · mesh_weight

cascadingcrit

Cascading failure · fraud-check timeout

Reliability. Blast-radius score climbs as the cascade walks the graph. Mesh weight isolates the failing leg.

Ground truth · mesh_weight

ddoshigh

Layer-7 anomaly · one ASN

Security. WAF rule scoped to the offending ASN, not the whole edge. Replay corpus prevents collateral damage.

Ground truth · waf_rule

Act V · KPIs

Seven KPIs. Every one defined, every one measured.

Target · < 5 min

MTTR (autonomous)

Wall-clock from first signal to verifier-confirmed restoration. finishedAt − startedAt from the typed RunReport.

Target · > 90%

Noise reduction

Suppressed and auto-resolved alerts as a fraction of total alerts ingested, adjusted by blast score.

Target · < 60 s

Drift latency

Delta between the audit signal that detected the drift and the GitOps actuation that reverted it.

Target · > 99.9%

Deployment success

Rolling 30-day count of automated deployments without a follow-up rollback or hotfix. Persisted in Firestore episodes.

Target · > 99%

Tool-call validity

Critic verdicts per turn divided by total tool calls. Schema and parameter check fused from AgentTurn.policyViolations.

Target · < 1%

Hallucination rate

Verifier disagreement count over total verified runs, sampled at the confidence-gate phase.

Target · cumulative USD

Cost saved

Sum of right-size, spot-migrate and feature-flag actions, signed against the projected baseline. The agent has to pay for itself.

The Stack

Single Cloud Run binary. Reproducible by default.

  • Next.js 15 (App Router)
  • React 19
  • TypeScript 5.5 strict
  • Tailwind v4
  • framer-motion (reduced-motion)
  • Vertex AI · Gemini 2.5
  • Anthropic · Claude Opus 4.7
  • Deterministic stub gateway
  • LangGraph-style state machine
  • Firestore · embeddings + adjacency
  • Secret Manager
  • Cloud Logging · OpenTelemetry
  • Cloud Run · asia-east1
  • SSE streaming
  • Workload Identity Federation

Need a partner who can ship · or a mentor who can guide a team to ship the same way?

If your application looks like an agent that has to argue with itself, gate every action, and prove the run was reproducible, the conversation starts the same way.