World models in healthcare: what LLMs can't see

Guru Kora, CTO at Quantino Health, joined the host of Align Health Insights and a fellow founder from Actually Health on Episode 14 to talk about world models — the architectural bet that the next layer of healthcare AI won't be a better LLM, but a different shape entirely. The full conversation is above. A few threads from it below.

The framing matters. The last 24 months of healthcare AI have mostly been LLMs (large language models) doing token prediction: describing what to say next, what to chart next, what to code next. They are good at that. We use them every day inside Quantino's core for documentation, triage, and front-desk voice. The LLM was never going to be the whole stack, though. There is a ceiling, and it is getting closer.

What LLMs can't see

An LLM predicts the next token. That is the entire job: given the words so far, return the most likely next word. It can sound like reasoning because the text it was trained on contains reasoning. It is not, however, modelling the world the reasoning is about.

This works well where the tolerance for surface error is high. Drafting a discharge summary. Reformatting a prior authorization letter. Rough-drafting a SOAP note that a clinician will edit anyway. Where the cost of being wrong is "an editor catches it before it ships," LLMs are good infrastructure.

Where they break down is where the cost of being wrong is a patient outcome. Clinical reasoning — the real kind, not the LinkedIn version — runs as a continuous mental model of what is likely to happen next: if I push fluids here, if I delay the imaging, if I send the patient home tonight. That is not a token-prediction problem. That is a world-prediction problem, and it sits one architectural layer below where LLMs operate.

What a world model adds

A world model predicts the next state of an environment given an action, not the next word in a transcript. The research literature has been converging on a capability rubric:

L1: Temporal prediction
What happens next, given history.
L2: Action-conditioned prediction
What happens next, given an action we take.
L3: Counterfactual reasoning
What would have happened, had we done X instead of Y.
L4: Planning and control
Choose actions to drive the system toward a goal.

Most healthcare AI today sits at L1 or L2. The interesting clinical questions live at L3: readmission risk if we discharge tomorrow versus the day after, treatment trajectory under one regimen versus another, the probability that this specific patient with this specific comorbidity profile decompensates within 72 hours. Those are not questions an LLM can answer from text alone. They require a system that has learned how the world reacts to actions, not just how clinicians describe it after the fact.

The bridge from one to the other is the multimodal foundation model. It is not a world model. It cannot tell you what would happen if you changed the plan. But it can hold the joint distribution across labs, notes, imaging, vitals, and pathology in a single representation, which is the substrate a world model is built on top of.

rendering diagram

Why healthcare specifically

Healthcare data is, in Guru's words on the podcast, "very, very noisy." Labs. Notes. Imaging. Genomics. Wearables. Claims. Devices. A single MRI does not define an oncology trajectory. A single A1c does not define metabolic state. The signal lives in the joint distribution across modalities, and a token-stream architecture cannot hold that distribution at the resolution clinical decisions actually require.

The same multimodal logic applies to the unglamorous wedge problems we work on every day. A sleep clinic does not understand a patient by their last visit summary. It understands them by the home apnea reading, the spousal complaint about snoring, the weight trajectory, the medication list, and the no-show history. A clinic's "patient state" is a multimodal object even when the technology stack does not yet know it.

Where this lands first

Surgical robotics gets the most press, but the honest near-term zone is narrower than the press releases suggest. Training, simulation, and skill assessment are the 1-to-2-year horizon. Assistive systems — where a world model helps a surgeon, not replaces one — are the 3-to-5-year horizon. Fully autonomous surgery is further out than that, and anyone claiming otherwise is selling.

The commercial path looks like: highly bounded, highly observed specialties first. Pathology. Oncology. Interventional radiology. Anywhere the data resolution is dense enough that a model can learn what actions do, not just what doctors say about them afterward. General medical reasoning — give it any patient, get the entire trajectory — is the destination, but the training resolution to build it broadly does not exist yet.

The autonomous vehicle analogy is instructive. With 750,000 cars feeding back to a simulator, the path from Level 2 to Level 3 still took more than a decade, and Level 4 is still not deployed at scale. Healthcare has less data, more regulation, and a higher cost of being wrong. Slower is the realistic prior.

What we're not changing

Quantino's core today is an LLM-anchored runtime with healthcare-aware adapters above and below it. The architecture is positioned to absorb world models as the substrate matures, but we are not announcing one, and we are not chasing one for sport.

The reason is simple. Technology is a means. The question we ask first at every clinic and health system is what changes on the bottom line. A workflow-native AI system that adapts to how a clinic actually runs beats a state-of-the-art benchmark winner that does not. Operational AI — the back-office work, the front-desk work, the around-the-visit work — is where the next several years of healthcare AI dollars live. That is where we run, and that is where the math works.

The real signal is not in the press releases. It is in the before-and-after metrics, in production.

When world models become the right tool for a problem we have, we will use them. Until then, the work speaks for itself.

What we're watching

A few honest pointers if you want to track this space without the LinkedIn noise:

Yann LeCun's AMI (Advanced Machine Intelligence). Raised $1.2B at a $3.5B valuation on the explicit thesis that LLMs are a dead end for real-world intelligence. Worth taking seriously because LeCun has been early — and right — about architecture shifts before.
Faisal Mahmood's lab at Brigham and Women's Hospital. Multimodal foundation models trained on health records. The most credible academic path toward L3 clinical reasoning we have seen.
Healthcare world model survey papers. A recent one out of Columbia and Northwell on surgical world models is a good entry point. If the math is dense, drop the PDF into your LLM of choice and have a conversation with it. That is a legitimate way to read papers now.
The L1-to-L4 rubric itself. When a vendor pitches "AI for clinical reasoning," ask which level on the rubric their system actually sits at. Most are L1 dressed as L3.

The conversation is above. If you are working on these problems and want to compare notes, get in touch.

TagsWorld modelsLLMsFoundation modelsPodcastAlign Health Insights

← All posts

What world models can do that LLMs can't