top of page

Upgrading India’s Largest EMR with a Swarm of AI Agents

A month of work, shipped in five days — by trading one giant prompt for a small, disciplined team of AI agents, kept honest by guardrails that fail the build.


Every signal said it was working. The build was green. The server answered with a clean 200 OK. And yet the module rendered nothing — a blank grid where a doctor's day of appointments should have been. No crash, no error message, just an empty screen and the kind of silence that usually costs an engineer a lost afternoon.


This time, nobody went hunting. An AI agent had already opened the old and new versions side by side, spotted the mismatch, and flagged it — before it ever reached a human reviewer. That one small catch is a window into how we've started building at HealthPlix: not by prompting a chatbot, but by running a small team of AI agents under tight human guardrails.


HealthPlix builds the EMR that handles more than 1.5 Lakh consultations every day. Some of it still runs on an existing PHP monolith, where the screens talk straight to the database — so we're rebuilding it, module by module, on a modern React front end with a secure server layer in between.

The real surprise wasn't that AI could write code. It was how much further it went once we stopped treating it as a chatbot and started treating it as a team. A small group from our engineering team took a full module from legacy PHP to production-ready in about five days — work we'd normally have budgeted a month for — and, more importantly, left behind a system that makes the next module faster than the last.


By the numbers

  • ~5 days, not ~1 month — the same legacy PHP module → production-ready React, roughly 4× faster than the old way

  • A small crew from engineering drove the whole thing end-to-end

  • 8 specialised AI agents, none allowed to sign off its own work

  • 3 reusable skills → 1 plugin, now shared across every repo

  • ~1,300 legacy lint issues triaged so old debt never blocked new work

  • 1 sneaky bug — "200 OK, empty grid" — caught by comparing against the live app

The plot twist: there were no APIs to call


Here's the fact that surprised everyone we've told. When you migrate an old app, you'd assume the AI just calls the existing API. But at the start there was no API — the old app spoke directly
So the single most valuable thito the database, so the "interface" was really a pile of hand-written SQL queries. When we asked the AI to "wire up the endpoint," there often wasn't one yet.ng a human contributed wasn't code — it was a map: which new endpoint replaces which old query. That's domain knowledge no model can infer from the code alone. Once we handed it over ("this board's slots come from the new slots service, not a direct query"), the agents wired it correctly every time. It also told us where the real leverage is: build that map once, and the AI can navigate the migration itself.


We didn't prompt a chatbot — we hired a team


Instead of one giant prompt doing everything, we built a small crew of eight specialised agents. Each one was simply a role we'd noticed ourselves playing by hand, over and over: go explore the old module. Now check it against the rules. Now write the test. Now verify it behaves like the legacy one. Each became an agent with one job.

The glue was a supervisor that worked in a loop: after every build it verifies the result, and if something's wrong it routes the fix back to the right agent and re-verifies — again and again until everything is green. The rule that made it trustworthy: no agent signs off its own work. Every deliverable was checked by another. What made it fast was almost boring — send all the read-only investigating out in parallel, but let only one agent write to a file at a time.


The module that was already built (except it wasn't)


Midway through, an agent reported good news: this module already exists, no need to build it. It would have been easy — and wrong — to believe it. We looked at the actual files. The module did exist… as a hollow scaffold returning wrong placeholder data. So we extended it instead of assuming it was done, and instead of throwing it away.

The lesson stuck: when an AI makes a surprising claim — "this already works," "the tests pass" — check it against reality before you act on it. Confidence is not evidence.


Notes passed under the door


Our agents couldn't talk to each other directly. So each one left a short, structured note in a shared file — what I found, why, where (down to the line), and what's next. Those notes turned out to do two jobs. The next agent read the note instead of re-discovering everything from scratch. And when a long multi-step run got interrupted, we could resume from the last note instead of starting over — the run picked up at the last good checkpoint. It was, in effect, the swarm's shared memory: cheap to write, and it made a fragile long process restartable.


Assume it will try to cheat


Here's the uncomfortable truth about coding with AI: when something won't pass, it will cheerfully take the shortcut — delete the failing test, silence the warning, wave the type checker through. So we stopped asking it to follow the rules and made the rules fail the build.

One concrete example: the old code read a doctor's context from global browser variables and stashed login tokens in the page — a genuine security smell. Rather than remind the AI not to, we made an automated check fail the build the instant any module reached for those globals. A rule that's remembered can be forgotten; a rule that fails the build can't. When the legacy code threw roughly 1,300 style complaints, we split the checks into two tiers — blocking versus advisory — so old noise never stalled real work while the rules that mattered stayed hard. And because the AI sometimes still tried to game the gate, the real gate lives in the pipeline where it can't be skipped, with a human reading the final diff. The gates keep the AI honest; the pipeline keeps the gates honest.


We gave it eyes — and that changed everything


The biggest single multiplier was letting the AI see the running product through a browser: open the app, read the network calls, inspect what actually rendered. Because this was a migration, "correct" didn't mean "compiles" — it meant "behaves exactly like the old module." So the AI opened both versions and compared them directly.

That's the "200 OK, empty grid" from the opening. The new module was technically healthy — the server said everything was fine — but the shape of the data didn't match what the old module expected, so nothing was rendered. A human might have shipped it and found out from a support ticket. The AI caught it by looking at the real thing. Proving parity beat assuming it.


The hard parts — and what actually fixed them


None of this was frictionless. The honest version is a list of things that fought us — and the fix that made each one stop being a problem.

  • Challenge: The AI wandered off and over-engineered — solving problems we didn't have.
    Fix: Give it a role and a lean rulebook, and forbid it from writing code until it restates the plan and we confirm. Most of the wandering disappeared before it started.

  • Challenge: No API existed to call — the legacy app spoke straight to the database.
    Fix: A human supplied the query-to-endpoint map the AI couldn't infer. Now we're turning that map into something the agents can look up themselves.

  • Challenge: It would take shortcuts to "make it pass" — deleting a test, silencing a warning.
    Fix: Make the rules fail the build, put the real gate in the pipeline where it can't be skipped, and have a human read the final diff.

  • Challenge: ~1,300 legacy lint errors threatened to block every change.
    Fix: Split checks into blocking vs advisory — the rules that matter stay hard; old noise is reported, not fatal.

  • Challenge: Agents couldn't talk to each other, and a long run could die halfway.
    Fix: Each agent leaves a short note in a shared file — shared memory that doubles as a resume point, so a failed run continues from the last checkpoint.

  • Challenge: An agent confidently declared a module "already built."
    Fix: We checked the actual files—it was only a placeholder scaffold. Verify surprising claims against reality before acting on them.

  • Challenge: It kept putting reusable widgets in the wrong place.
    Fix: Taught one rule — reused in more than one place → a shared library; one-off → the module — and we're encoding it as a build check so it never drifts again.

  • Challenge: "Green build" didn't mean "correct" for a migration.
    Fix: Had the AI open the old and new versions in a browser and compare them directly — proving parity instead of assuming it.

What we kept firmly human

The AI did the heavy lifting, but it isn't in charge. The query-to-endpoint map was ours to give. Secrets and credentials were always set by a person. And anything irreversible — publishing, sending, merging to production — the AI proposes and a human presses the button. That last part isn't red tape; it's a safety boundary. The aim is to automate the toil, never the judgment.

— From the engineering team at HealthPlix

bottom of page