Work / AI · Agents / SmartEvaluator-Omni

No. 19 · AI · Multi-agent grading swarm

Four AI judges argue.
One fair mark.

One examiner is tired, biased, slow and outnumbered. Four specialised AI judges aren't. They read the same answer in parallel, lay out their reasoning, weigh each other's verdicts, and converge on a mark a student can dispute and a teacher can defend.

View source Meet the judges The Digital Twin

Case Q-0431 · OS · Semaphores · 10 marks "Explain how a binary semaphore prevents race conditions in producer-consumer."

Judge 01 · Gemini

Definition is correct. Cited atomic P() and V(). Missed the bounded-buffer wraparound case. Strong on theory, light on application.

Vote · 7.5 / 10

Judge 02 · Llama 3

Pseudocode is syntactically valid. Init values stated. Critical section is correctly bracketed. Would award full 4 of 4 for code.

Vote · 8.0 / 10

Judge 03 · Claude

Concur on theory. Disagree with Judge 01 on the missed case, the student does mention "modulo N" implicitly. Half-mark restored.

Vote · 8.0 / 10

Judge 04 · BERT · Plagiarism

Cosine similarity 0.18 against the corpus. No verbatim copy detected. No veto raised.

Pass · No veto

Consensus engine

Weighted average across judges (0.30 / 0.25 / 0.30 / 0.15 with veto). Consensus mark recorded with full deliberation log.

Final · 7.83 / 10

Final mark · explainable · auditable

Awarded 7.83 / 10 with deliberation log attached.

7.83

4Specialised LLM judges

1Veto vote on plagiarism

2Inference modes (cloud + local)

∞Audit log entries per question

0Single-judge bias

Act I · The Problem

A tired examiner
at 2 a.m.

Examiners grade hundreds of papers a week. Marks reflect mood, fatigue and order, not just the answer. Students can't see the reasoning. Teachers can't defend the score.

Single-rater bias.

One examiner equals one judgment call. Studies repeatedly show inter-rater variance of two full points on the same essay.

ii.

No audit trail.

The student receives a number. There is no log of the reasoning behind it. Disputes go nowhere.

iii.

Plagiarism caught too late.

Manual checks happen after the mark is awarded. By then the verdict is on the report card.

iv.

One model is one bias.

Use only Gemini and you inherit Gemini's blind spots. Use only Claude and you inherit Claude's. The fix is plurality.

Act II · The Bench

The four judges.
One specialty each.

Each model is asked the question it is best at. Their verdicts are weighted. The plagiarism judge can veto outright.

Judge 01

Gemini Pro

The theorist.

Reads the answer for conceptual accuracy. Strongest on definition, derivation, citation. Cloud inference via Google API.

Judge 02

Llama 3 (Ollama)

The code reader.

Local inference. Reads pseudocode and source for syntactic and structural correctness. No data leaves the building.

Judge 03

Mistral or Claude

The dissenter.

Reads the other judges' verdicts and argues against them when warranted. Restores marks the panel missed.

Judge 04

BERT (similarity)

The veto.

Computes cosine similarity against the corpus. Plagiarism above threshold cancels the panel verdict.

Act III · The Digital Twin

Grade in your teacher's voice.

A Vector RAG persona engine learns each teacher's style. Strict on units? Generous on partial credit? The swarm grades the way that teacher would, every time.

The Persona RAG engine.

Past corrected answers feed ChromaDB. Every new submission is graded by the swarm with that teacher's vector context injected. Two teachers get two different fair marks for the same paper, and both can defend their own.

"Award partial credit for correctly stated principle even if the worked example is missing." Dr Mehta's grading style, learned over 312 prior assessments.

Act IV · Hybrid Infra

Cloud APIs and local Ollama.
Switch on a per-question basis.

A router decides which judge runs in the cloud and which runs on-prem. Sensitive student data never has to leave the institution.

Python 3.10+
FastAPI (async)
LangChain · CrewAI
ChromaDB
Pinecone (cloud)
Ollama · Llama 3 local
Google Gemini Pro
Anthropic Claude
OpenAI GPT-4
BERT similarity
Weighted consensus matrix
Audit trail per question

If a mark must be defensible, one model is not enough.

I architect multi-agent grading systems with auditable consensus, persona learning and on-prem fallback. Built for institutions where every mark might land in court.

Hire me Back to work