memleakhigh
Memory leak in payments-api v2.4
Reliability. Heap retainer in the v2.4 connection pool. Adversary checks the cache-warm counter-hypothesis.
Ground truth · rollback
Work / AI · Agents / SentinelCloud
Capstone · AI Agents · Cloud · DevOpsA single LLM wired to a tool registry hits 30 to 40 percent on real incident traces and confidently deletes the wrong pod. SentinelCloud picks the boring fight. Build the missing structure, measure everything, keep the demo deterministic.
Act I · The Problem
AIOpsLab, ITBench, RCAEval, AutoSRE all report the same number on real traces: 30 to 40 percent. The agent is fluent. The agent is wrong. There is no second voice in the room.
The model invents a flag the kubectl binary has never seen. Without a verifier checking every proposed call against a typed schema, the action lands and the page goes red.
"Restart this pod" is one command. It might also take down forty downstream services. Without a graph search before the action runs, the engineer ships the pager along with the fix.
The model says "high confidence" and is wrong. The system has to know when to auto-act, when to pause for a human, and when to refuse to ship. That decision has to be in code, not in a vibe.
Every paper benchmarks a different way, on a different fixture, with a different LLM. The numbers do not compose. A reviewer cannot replay the run.
Act II · The Promise
Every turn is a typed object with confidence, token count, and latency. The Devil's Advocate is contractually pinned to dissent. The Critic scores every tool call against a schema before dispatch. Group-think has no oxygen.
Act III · The Gates
INGEST → ANALYZE → DEBATE → STRATEGIZE → CRITIC → SAFETY → VERIFY → POLICY GATE → CONFIDENCE GATE → AUTO_ACT or HITL_PAUSE → VERIFY OUTCOME → LEARN. The kill switches stay on at every phase of rollout.
Gate 01 PASS
Compiled rules from the plain-English constitution. Hot path, cached. Allow-listed action verbs only.
Gate 02 PASS
A dedicated LLM judges intent against the constitution. Catches the "regex-passes, intent-fails" class of incident.
Gate 03 PASS
BFS over the dependency graph. 0 to 100 score. Above the cap, the action does not run · the page does.
Gate 04 PASS
Per-action-class confidence threshold. Below threshold, the run pauses and writes a human-on-the-loop summary.
Act IV · The Scenarios
Ground-truth root cause and ground-truth action are encoded in every fixture so the orchestrator's choice is scored against an oracle. Set SENTINEL_FORCE_STUB=1, run any scenario twice, the numbers match.
memleakhigh
Reliability. Heap retainer in the v2.4 connection pool. Adversary checks the cache-warm counter-hypothesis.
Ground truth · rollback
dbpoolhigh
Reliability. Pool saturated. Pod restart pattern matches the historical recovery curve.
Ground truth · restart_pods
cvecrit
Security. WAF rule synthesised from the CVE description, validated against a replay corpus before activation.
Ground truth · waf_rule
finopsmed
FinOps. Strategist walks the price · eviction · tolerance Pareto frontier. Right-size action signed against projected baseline.
Ground truth · right_size
drifthigh
Drift. Audit signal fires. Reverter opens a GitOps PR within the drift-latency target.
Ground truth · mesh_weight
cascadingcrit
Reliability. Blast-radius score climbs as the cascade walks the graph. Mesh weight isolates the failing leg.
Ground truth · mesh_weight
ddoshigh
Security. WAF rule scoped to the offending ASN, not the whole edge. Replay corpus prevents collateral damage.
Ground truth · waf_rule
Act V · KPIs
Target · < 5 min
Wall-clock from first signal to verifier-confirmed restoration. finishedAt − startedAt from the typed RunReport.
Target · > 90%
Suppressed and auto-resolved alerts as a fraction of total alerts ingested, adjusted by blast score.
Target · < 60 s
Delta between the audit signal that detected the drift and the GitOps actuation that reverted it.
Target · > 99.9%
Rolling 30-day count of automated deployments without a follow-up rollback or hotfix. Persisted in Firestore episodes.
Target · > 99%
Critic verdicts per turn divided by total tool calls. Schema and parameter check fused from AgentTurn.policyViolations.
Target · < 1%
Verifier disagreement count over total verified runs, sampled at the confidence-gate phase.
Target · cumulative USD
Sum of right-size, spot-migrate and feature-flag actions, signed against the projected baseline. The agent has to pay for itself.
The Stack
If your application looks like an agent that has to argue with itself, gate every action, and prove the run was reproducible, the conversation starts the same way.