Work / AI · Agents / Resha
No. 24 · AI · Agents · HR
The fine line between
shortlisted and
rejected.
रेशा · Resha · the fibre
A recruiter receives 600 resumes for one role. They open 30. Twenty die in the first ninety seconds. Brilliant candidates fall on "page two". Resha gives every candidate the same, fair, three-tier review: on-device, local LLM, cloud fallback. Same prompt, same rubric, every time.
Act I · The Funnel
Six hundred resumes.
One role.
Ninety seconds each.
This is a math problem before it is a hiring problem. A recruiter physically cannot give a fair review to 600 resumes. So they don't. They scan, they skim, they reject on font choice, and the company loses the engineer they needed.
Resumes received for one role
Filename or first-line pass
Actually opened
Survive the first 90 seconds
The other 590 are not bad candidates. They are unread.
Act II · The Promise
Three inference tiers.
One verdict you can defend.
Cloud goes down. Bandwidth dies. The intern's laptop has no GPU. Resha works anyway. The orchestrator picks the best available tier and routes the analysis. When two or more tiers agree, the consensus weights raise the confidence.
Onboard
SBERT semantic match · TF-IDF
Pure Python, CPU-only, runs anywhere. Spacy entity extraction over skills, experience and education. all-MiniLM-L6-v2 embeds resume and job description, cosine similarity scores the match.
Local LLM
Ollama · Phi-3, Qwen2.5, Gemma 3
Runs on the candidate's data, on your machine. No internet required. Five-minute timeout, model warm-up on startup, LRU cache for repeats. ollama pull gemma3:1b and you are off.
Cloud LLM
Google Gemini fallback
When you have signal and you want the highest ceiling. Sub-3s round-trip, real chain-of-thought streaming back to the dev panel, full reasoning trace logged to SQLite for audit.
Act III · The Reasoning
Real chain-of-thought.
Not simulated steps.
Resha streams every reasoning step from the LLM as it happens. Server-Sent Events with buffering disabled, so a recruiter or auditor can watch the model think, in order, in real time.
Act IV · The Microservice
Drop into any ATS.
FastAPI in, JSON out.
| Method | Endpoint | Purpose |
|---|---|---|
| POST | /api/analyze | SBERT semantic analysis · resume vs JD |
| POST | /api/shortlist | Cloud LLM shortlist verdict |
| POST | /api/analyze-file | Upload PDF/DOCX/TXT for analysis |
| GET | /api/history | SQLite audit trail of decisions |
| POST | /api/dev/analyze-stream | Streaming chain-of-thought · SSE |
| POST | /api/dev/hybrid-analyze | Multi-model consensus verdict |
| POST | /api/dev/warmup | Preload Ollama model into RAM |
| DELETE | /api/dev/cache | Clear the LRU response cache |
Act V · Proof
Production-deployed. Audit-ready.
Live · reas.dmj.one/task2/
One-click systemd deployment. Nginx reverse-proxy, port 22000, isolated from /task1/ service. Auto-generated secret keys, hardened security headers.
SQLite audit trail
Every decision logged with the inputs, the tier used, the verdict, the reasoning. Defensible if a candidate ever asks why.
Strict CSP, HSTS, X-Frame-Options
API-key authentication, no plaintext secrets, security headers on every response. Drop into a regulated environment without rewriting policy.
Four local model presets
Gemma 3 1B (~1GB), Qwen2.5 3B (~2GB), Gemma 2B (~1.5GB), TinyLlama (~700MB). Pick by hardware, the orchestrator handles the rest.
The Stack
Hybrid by design. Offline by default.
- Python 3.10+
- FastAPI
- SBERT MiniLM-L6
- spaCy en_core_web_sm
- PyTorch CPU
- Ollama
- Gemma 3 / Qwen2.5
- Gemini fallback
- SQLite audit
- SSE streaming
- systemd
- Nginx reverse-proxy
If 590 unread resumes can become 590 fair reviews, your hiring can change too.
I build inference orchestrators that route across edge, local and cloud, and stream their reasoning so humans stay in the loop. If your AI needs to defend itself, talk to me.