Loading...
Microsoft AI CEO Mustafa Suleyman: Why Our AI Diagnostician Outperforms DoctorsAs AI models get commoditized, the value will be added in that final layer of orchestration, Suleyman says.
Microsoft this week announced it built an AI diagnostician that outperforms human doctors on complex cases. The system, called MAI-DxO, uses two bots to sort through a patient’s medical history and solves 85.5% of patient cases when paired with OpenAI’s o3 model. The results are a major leap above the 20% average accuracy that human doctors achieved on the same cases, although the humans were restricted from searching the web or speaking with colleagues. In an in-depth conversation shortly after Microsoft announced the results, Microsoft AI CEO Mustafa Suleyman shared how the AI diagnostician was able to 4X the performance of human doctors, what it means for the future of medicine, and whether this is a positive trend for society. You can read our full conversation below, edited lightly for length and clarity. Alex Kantrowitz: Hi Mustafa, good to see you again. First off, Copilot and Bing now field 50 million medical queries per day. Is that good? Mustafa Suleyman: It's incredible, because we're already making access to information super cheap and concise with just search engines. And now with Copilot, answers are much more conversational. You can tone them down so they suit your specific level of knowledge and expertise, and as a result, more and more people are asking Copilot and Bing health-related questions. The queries range from anything from a cancer issue that someone's dealing with, to a death in a family, to a mental health issue, to just having a skin rash. And so the variety is huge, but obviously we've got a really important objective here to try and improve the quality of our consumer health products. Do the health questions that come into chatbots look different from search? Copilot's answers tend to be more succinct and responsive to the style and tone of the individual person asking the question, and that tends to encourage people to ask a second follow-up question. So it turns it into more of a dialog or a consultation that you might end up having with your doctor. So they are quite different to a normal search query. Speaking of dialogs, let's discuss Microsoft’s new AI diagnostician system. It's actually two bots, where one bot acts as a gatekeeper to all a patient's medical information, and the other asks questions about that history and makes a diagnosis. You’ve found the system performs better than humans in diagnosing disease. That's exactly right. We essentially wanted to simulate what it would be like for an AI to act as a diagnostician, to ask the patient a series of questions, to draw out their case history, go through a whole bunch of tests that they may have had — pathology and radiology — and then iteratively examine the information that it's getting in order to improve the accuracy and reliability of its prediction about what your diagnosis actually is. We actually use the New England Journal of Medicine case histories, hundreds of these past cases. One of these cases comes out every single week, and it's like an ultimate crossword for doctors. They don't see the answer until the following week. And it's a big guessing game to go back through five to seven pages of very detailed history, and then try to figure out what the diagnosis actually turns out to be. I thought one of the benefits of generative AI is it can take in a lot of information and then come to answers — often in one shot. What’s the benefit of having multiple bots sort through it? The big breakthrough of the last six months or so in AI is these thinking or reasoning models that can query other agents, or find other information sources at inference time, to improve the quality of its response. Rather than just giving the first best answer, it instead goes and consults a range of different sources, and that improves the quality of the information that it finally gets to. So we see that this orchestrator, which under the hood uses four different models from the major providers, can actually improve the accuracy of each of the individual models. And collectively, all of them together by a very significant degree, about 10% or so. So it's a big step forward. And I think that as the AI models get commoditized, really, all the value will be added in that final layer of orchestration, product integration, and that's what we're seeing with this diagnostic orchestrator. So it’s a 10% increase in accurately diagnosing on top of the standard LLMs? Yes. And in fact, we actually benchmark that against human performance. So we had a whole bunch of expert physicians play this simulated diagnosis environment game, and they, on average, get about one in five, right? So about 20%. Whereas our orchestrator gets about 85% accuracy, so it's four times more accurate, which, in my career, I've never seen such a big gulf between human level performance and the AI system's performance. Many years ago, I worked on lots of diagnoses for radiology and head and neck cancer and mammography, and the goal was just to take a single radiology exam and predict, does it have cancer? And that was the most we could do. Whereas now it's actually producing a very detailed diagnosis, and doing that sequentially through this interactive dialog mechanism. And so that massively improves the accuracy. What if you have the same thing happen to medicine as is happening with beginner level code, where people learn to code using copilots, but when something breaks, it becomes harder for them to figure out what's going on. If you're a doctor, if you outsource some of your thinking to these bots, is that a problem?... Subscribe to Big Technology to unlock the rest.Become a paying subscriber of Big Technology to get access to this post and other subscriber-only content. Upgrade to paidA subscription gets you:
© 2025 Alex Kantrowitz |
Loading...
Loading...