Work /
AI · Agents /
SmartEvaluator-Omni
No. 19 · AI · Multi-agent grading swarm
Four AI judges argue.
One fair mark.
One examiner is tired, biased, slow and outnumbered. Four specialised AI judges aren't. They read the same answer in parallel, lay out their reasoning, weigh each other's verdicts, and converge on a mark a student can dispute and a teacher can defend.
Case Q-0431 · OS · Semaphores · 10 marks
"Explain how a binary semaphore prevents race conditions in producer-consumer."
Judge 01 · Gemini
Definition is correct. Cited atomic P() and V(). Missed the bounded-buffer wraparound case. Strong on theory, light on application.
Vote · 7.5 / 10
Judge 02 · Llama 3
Pseudocode is syntactically valid. Init values stated. Critical section is correctly bracketed. Would award full 4 of 4 for code.
Vote · 8.0 / 10
Judge 03 · Claude
Concur on theory. Disagree with Judge 01 on the missed case, the student does mention "modulo N" implicitly. Half-mark restored.
Vote · 8.0 / 10
Judge 04 · BERT · Plagiarism
Cosine similarity 0.18 against the corpus. No verbatim copy detected. No veto raised.
Pass · No veto
Consensus engine
Weighted average across judges (0.30 / 0.25 / 0.30 / 0.15 with veto). Consensus mark recorded with full deliberation log.
Final · 7.83 / 10
Final mark · explainable · auditable
Awarded 7.83 / 10 with deliberation log attached.
7.83
4Specialised LLM judges
1Veto vote on plagiarism
2Inference modes (cloud + local)
∞Audit log entries per question
0Single-judge bias
Act I · The Problem
A tired examiner
at 2 a.m.
Examiners grade hundreds of papers a week. Marks reflect mood, fatigue and order, not just the answer. Students can't see the reasoning. Teachers can't defend the score.
i.
Single-rater bias.
One examiner equals one judgment call. Studies repeatedly show inter-rater variance of two full points on the same essay.
ii.
No audit trail.
The student receives a number. There is no log of the reasoning behind it. Disputes go nowhere.
iii.
Plagiarism caught too late.
Manual checks happen after the mark is awarded. By then the verdict is on the report card.
iv.
One model is one bias.
Use only Gemini and you inherit Gemini's blind spots. Use only Claude and you inherit Claude's. The fix is plurality.
Act II · The Bench
The four judges.
One specialty each.
Each model is asked the question it is best at. Their verdicts are weighted. The plagiarism judge can veto outright.
Judge 01
Gemini Pro
The theorist.
Reads the answer for conceptual accuracy. Strongest on definition, derivation, citation. Cloud inference via Google API.
Judge 02
Llama 3 (Ollama)
The code reader.
Local inference. Reads pseudocode and source for syntactic and structural correctness. No data leaves the building.
Judge 03
Mistral or Claude
The dissenter.
Reads the other judges' verdicts and argues against them when warranted. Restores marks the panel missed.
Judge 04
BERT (similarity)
The veto.
Computes cosine similarity against the corpus. Plagiarism above threshold cancels the panel verdict.
Act III · The Digital Twin
Grade in your teacher's voice.
A Vector RAG persona engine learns each teacher's style. Strict on units? Generous on partial credit? The swarm grades the way that teacher would, every time.
The Persona RAG engine.
Past corrected answers feed ChromaDB. Every new submission is graded by the swarm with that teacher's vector context injected. Two teachers get two different fair marks for the same paper, and both can defend their own.
"Award partial credit for correctly stated principle even if the worked example is missing." Dr Mehta's grading style, learned over 312 prior assessments.
Act IV · Hybrid Infra
Cloud APIs and local Ollama.
Switch on a per-question basis.
A router decides which judge runs in the cloud and which runs on-prem. Sensitive student data never has to leave the institution.
- Python 3.10+
- FastAPI (async)
- LangChain · CrewAI
- ChromaDB
- Pinecone (cloud)
- Ollama · Llama 3 local
- Google Gemini Pro
- Anthropic Claude
- OpenAI GPT-4
- BERT similarity
- Weighted consensus matrix
- Audit trail per question
If a mark must be defensible, one model is not enough.
I architect multi-agent grading systems with auditable consensus, persona learning and on-prem fallback. Built for institutions where every mark might land in court.