July 1, 2026 · 4 min read

The Google for doctors got tied by Google.

TL;DR [show]

On June 12 Nature Medicine reported that general frontier models beat OpenEvidence and UpToDate's Expert AI across three medical benchmarks, and that OpenEvidence, the $12B 'Google for doctors,' drew level with the free Google AI Overview on real physician queries. The trade press read it as a threat to the specialized tools. But OpenEvidence gives its answer away for free and earns its money selling physician attention to pharma, so a benchmark of answer quality measures the layer the company does not charge for. What a benchmark can grade is the answer text, which was always going to commoditize; what it cannot see is the habit, the ad inventory, and the trust that book the revenue. The exposure the paper actually points at is not a better model, it is that the free system it tied belongs to Google, the one competitor with the distribution to become the physician's new default.

The Google for doctors got tied by Google — by Thomas Jankowski, aided by AI — *A benchmark can only grade the free sample*— TJ x AI

On June 12, Nature Medicine ran the benchmark nobody selling clinical AI wanted anyone to run. GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 were put up against the two flagship specialized tools, OpenEvidence and UpToDate's Expert AI, across three medical benchmarks. The general models won all three. And on the benchmark built from real physician queries, OpenEvidence, the twelve-billion-dollar company doctors call the Google for doctors, landed level with Google's free AI Overview.

Level with the free one.

The trade press filed it as a threat, and on its face it reads like one. A health system pays for OpenEvidence on the theory that medical tuning produces answers a general chatbot cannot match. A peer-reviewed table just reported that the tuning buys nothing the frontier models do not already have, and on real clinical questions nothing a free consumer feature does not already have. If the medical answer was the product, the product now ships in the search box at a price of zero.

Except OpenEvidence does not sell the answer. It gives the answer away. Any clinician with a national provider ID uses it for free, and the bill goes to pharma. The company sells advertising against clinical queries at CPMs that reportedly run from seventy dollars into four figures, against five to fifteen for a social feed, and it clears somewhere around a hundred and twenty dollars a year per physician. Its last reported revenue cleared a hundred million, almost all of it from that inventory. Read the income statement and OpenEvidence looks less like a medical-answer company that happens to be free than like an advertising business whose product is a doctor's attention, with the answer as the loss-leader it runs to rent that attention.

That distinction decides what the Nature paper could actually price.

A benchmark needs a gradeable output. Feed every system the same question, score the response, rank the field. The procedure only functions on the one layer where the products are comparable, the answer text, and it is silent on the layer that books the revenue: whether a physician already opens the tab out of habit, whether a pharma brand will pay a four-figure CPM to sit beside an oncology query, whether a doctor trusts the source enough to act on it in front of a patient. None of that is visible to a benchmark at all. So any head-to-head run across clinical AI is built to find convergence, because it grades the one component that was always going to converge and never touches the components that were never in the model.

Which makes the finding precise and narrow at once. It established, in print, with a control arm, that the answer layer has commoditized. For a company that charged for answers, that is the whole game. OpenEvidence charges for the audience that shows up to read them, and the audience did not appear in the table.

None of which makes the company safe. The paper does point at a real exposure, just not the one that got written up. A doctor is never going to paste a case into a raw model and grade it against OpenEvidence, so GPT scoring higher changes nothing about anyone's Tuesday. The dangerous word is tied, and the name attached to it. The system that drew level on real physician queries was Google's, and Google is the only competitor that can drop a free clinical answer into a surface the doctor already lives in, inside Search, inside Workspace, inside the phone, with no new tab to learn. What protects OpenEvidence is the default: roughly forty percent of American physicians already start there. Defaults erode from one direction, a rival with enough distribution to become the new default before the user has decided anything, and Google just benchmarked its way to parity on the exact metric a doctor would use to justify the switch.

This is the same floor the frontier labs are already watching form under their own product: once a capable answer is free at the point of care, fine-tuning cannot lift it back above zero. The benchmark is the first time that floor showed up in medicine with a citation attached.

So the accurate read of the paper is smaller and more interesting than "specialized clinical AI lost." It measured one layer, found it flat, and that layer was the free sample rather than the business. The number that has to be defended is the twelve billion, and twelve billion was never priced on owning the best medical model. It was priced on owning the physician's first query of the day. The Nature paper says nothing about that first query. The company sitting level with OpenEvidence in the results, the one that gives its own answer away and already reaches every doctor, says everything about who could take it.

—TJ