Work / Bharat-First / VerifiedTutor

Capstone · Bharat-First · Educational AI

A tutor that cites the page, or refuses to answer.

Most "AI tutors" hallucinate confidently and cite nothing. VerifiedTutor is the opposite. Every answer is retrieved from the prescribed lecture PDFs, generated by a local LLM, then verified claim-by-claim by an NLI model and shown with a calibrated confidence band, page citations, and a self-explanation prompt.

3Models · all local · no cloud
304Chunks · 384-dim index
12Reproducible test queries
5Research gaps closed in one pipeline
0Outbound calls during inference

Act I · The Problem

A confident wrong answer is worse
than no answer at all.

  1. Pain · 01

    Faithfulness has no meter.

    Manakul 2023 and Es 2024 measured RAG hallucination rates that the systems themselves never surface. The user reads a fluent paragraph and assumes the citation supports it. Often it does not.

  2. Pain · 02

    "Lost in the middle" eats the right answer.

    Liu 2024 showed long contexts collapse around middle chunks. A retriever that returns five passages and lets the LLM swallow them whole quietly drops the one that actually matters.

  3. Pain · 03

    Confidence is not calibrated.

    Kadavath 2022 demonstrated that LLM confidence scores correlate poorly with correctness. "I am sure" and "I am wrong" travel together more often than the model knows.

  4. Pain · 04

    An answer is not a lesson.

    Chi 1989 showed self-explanation outperforms passive reading. A tutor that hands the student the answer and walks away is a search engine, not a teacher.

  5. Pain · 05

    Privacy in education is not optional.

    Khosravi 2022 made the case plainly: student data leaving the laptop is a regulatory and ethical liability. A tutor that calls a third-party endpoint is a vendor lock-in dressed as pedagogy.

Act II · The Promise

Retrieve. Generate. Verify. Calibrate. Teach.

The hardest line in the pipeline is the one that says: refuse if the maximum similarity is below 0.30. The tutor is allowed to say "I do not know." That is the line a search engine cannot draw.

  1. Step 01

    Hybrid retrieval.

    Semantic 0.7 plus lexical 0.3, k=5 over 304 chunks. Refuses if the maximum similarity falls below 0.30.

    MiniLM L6-v2 + TF-IDF

  2. Step 02

    Strict-grounded generation.

    Qwen 2.5 1.5B Instruct streams the answer over Server-Sent Events into the browser. Strict grounding to the retrieved context.

    Qwen2.5-1.5B-Instruct

  3. Step 03

    Per-claim NLI verification.

    DeBERTa-v3 NLI cross-encoder runs against each retrieved chunk individually, defeating "lost in the middle" inside the verifier.

    cross-encoder/nli-deberta-v3-base

  4. Step 04

    Composite confidence band.

    Per-claim entailment plus embedding similarity fuse into a HIGH / MED / LOW band. Calibration cell in GAPS_CLOSED.md backs the threshold.

    HIGH · MED · LOW

  5. Step 05

    Active-learning scaffold.

    Every answer ships with a self-explanation prompt and a probing question. The tutor wants the student to think, not to copy.

    Self-explain · probe

Act III · The Evidence

Five published gaps. One local pipeline.

Every gap maps to a citation. Every closure maps to a measurement in the test battery. Reviewers re-run the battery and see the same numbers.

Gap · G1

Faithfulness in RAG

Per-claim NLI plus embedding similarity, surfaced to the user as a confidence band, not buried in a log file.

Manakul EMNLP 2023 · Es EACL 2024

Gap · G2

Lost in the middle

Hybrid retrieval k=5; per-claim NLI runs against each retrieved chunk, never against a fused blob.

Liu TACL 2024

Gap · G3

Calibration

Composite confidence (HIGH / MED / LOW) measured against the test battery and tuned for refusal correctness.

Kadavath 2022

Gap · G4

Pedagogical scaffolding

Self-explanation and probing question generated for every answer. Tutoring, not search.

Chi 1989

Gap · G5

Educational AI privacy

Fully local. The 3.6 GB of models live in ~/.cache/huggingface. Zero outbound calls during inference.

Khosravi 2022

Gap · 0

Refusal correctness

The tutor refuses when retrieval is weak. The test battery measures refusal alongside answer accuracy. Both are first-class metrics.

work/test_battery.py · 12 queries · 5 rounds

Act IV · Proof

It already runs. CPU image. Free tier.

Live · Cloud Run · us-central1

verifiedtutor.dmj.one

CPU image baked with all three models, sized for the Cloud Run free tier. SSE streaming, no GPU at runtime, no third-party API key required to demo.

Reproducible test battery

python work/test_battery.py

Twelve queries, five rounds. Reports per-claim entailment, confidence bands, and refusal correctness. Every gap-closure cell in GAPS_CLOSED.md is reproducible.

Three-model local stack

~/.cache/huggingface/

Qwen 2.5 1.5B Instruct (generator) · MiniLM L6-v2 (retriever) · DeBERTa-v3 NLI (verifier). 3.6 GB total. First-start download, instant after.

RAG index from prescribed corpus

work/rag_index.pkl

304 chunks, 384-dim embeddings, built from the Class 12 CS lecture PDFs by work/build_rag_index.py. Reproducible from work/pdf_to_md.py upward.

Defense slides + report

/pitch · /report

Capstone defense deck (←/→/↑/↓ navigation) and the DOCX report served from the same Flask app. One container, every artifact.

Open-source pipeline

gcr.io/dmjone/verifiedtutor:v2

Container image published. The recipe (PDF → corpus → index → app) is reproducible from the README on a 6 GB consumer GPU.

The Stack

Local-first. Citation-first.

  • Python 3.11
  • Flask
  • PyTorch 2.6
  • Qwen2.5-1.5B-Instruct
  • MiniLM L6-v2
  • DeBERTa-v3 NLI
  • Sentence-Transformers
  • scikit TF-IDF
  • pypdf
  • SSE streaming
  • Cloud Run · us-central1
  • Docker · CPU image
  • Hugging Face Datasets

Need a partner who can ship · or a mentor who can guide a team to ship the same way?

If your application looks like an LLM that has to cite a primary source, refuse when it does not know, and protect privacy from line one, the conversation starts the same way.