With the rapid rise of large language models in medicine, the conversation has shifted from “can AI do this?” to “can we trust it?” To explore this critical issue, we sat down with Laurent Giraid, a technologist whose work focuses on the intersection of artificial intelligence, machine learning, and ethics. His insights cut through the hype to address the foundational challenge of ensuring AI is not just effective, but fundamentally safe for clinical use, a topic at the heart of a new benchmark published in npj Digital Medicine. Our discussion centered on the specific ways this new framework uncovers hidden AI risks, the collaborative effort required to define clinical safety, and what the superior performance of specialized models means for the future of AI in healthcare. We also explored a practical roadmap for hospitals to adopt such tools and what the next five years might hold for safety-focused AI evaluation.
General-purpose AI models can produce hallucinations and confidently state errors. How does the Clinical Safety-Effectiveness Dual-Track Benchmark specifically test for critical failure modes, like contraindicated advice, and what was the most surprising safety gap you found in leading LLMs during your evaluation?
That’s really the core of the problem we’re trying to solve. Most evaluations of medical AI are like standardized academic tests—they check for correct answers but don’t probe for the kinds of failures that are truly dangerous in a clinical setting. The CSEDB framework is designed differently; it’s more like a rigorous clinical simulation. We use 2,069 open-ended scenarios to force the AI to reason through complex situations, specifically looking for safety-critical failures like missing urgent symptoms or making contraindicated recommendations. What was most surprising wasn’t just that these errors occurred, but the stark divergence we saw. Some of the most generally capable models, the ones that score brilliantly on broad knowledge tests, showed a significant lag in safety performance. They could confidently provide a detailed, yet dangerously flawed, plan for a patient with multiple conditions, revealing a critical gap between general-purpose performance and the specialized reliability required for medicine.
Your new framework was developed with 32 clinical experts from leading institutions. Could you walk us through the process of defining the 17 safety metrics with this group and share an anecdote about a specific indicator that was particularly challenging to agree upon?
Bringing together 32 experts from 23 different core specialties was an absolutely essential, and humbling, experience. It wasn’t about us, the technologists, telling them what to measure; it was about them defining what “safe” truly means in their day-to-day practice. The process was highly iterative, involving deep dives into real-world scenarios where things could go wrong. We would present a case, and the specialists would debate the potential failure points. One of the most challenging indicators to nail down was related to prioritizing care for patients with multiple complex conditions. An oncologist, a cardiologist, and a nephrologist might all have different, valid perspectives on what the most urgent issue is. Reaching a consensus on a single, measurable metric that could be applied consistently across different AI models required intense discussion to create a rule-based system that could capture that nuanced, multi-specialty clinical judgment.
The evaluation showed MedGPT scoring highest, with a notable lead in safety over general-purpose systems. What specific design or data strategies contribute to this enhanced safety profile, and what does this suggest about the future of specialized versus generalist AI in medicine?
MedGPT’s performance really highlights a fundamental question in medical AI: should we adapt general-purpose models for medicine, or build them for medicine from the ground up? Its stronger safety profile isn’t a coincidence; it’s the result of being a system optimized for safety from the very start. While general models like Gemini or Claude are incredibly powerful, they are trained on the vastness of the internet, which makes it incredibly difficult to sand down all the rough edges and potential for error in a high-stakes field like medicine. MedGPT, on the other hand, is built with clinical constraints as a core part of its architecture. This suggests that while generalist AI will continue to be a powerful tool, the systems that see deep, trusted clinical adoption will likely be those specialized models designed with a safety-first philosophy, where reliability isn’t an add-on but the central design principle.
For a benchmark like this to be useful, it must be integrated into hospital procurement and oversight. What practical, step-by-step process would a hospital need to follow to use this framework to evaluate a new AI tool before deployment, ensuring it meets their clinical standards?
This is exactly why we made the framework so structured. A hospital could begin by forming a small oversight committee of clinicians. First, they would run their candidate AI systems through the benchmark’s 2,069 Q&A items, which are designed to simulate their own complex patient cases. Next, using the CSEDB’s clear scoring system, they would evaluate the AI’s performance across the 30 distinct indicators—paying special attention to the 17 safety metrics. This gives them a detailed, evidence-based report card, not just a single accuracy score. Instead of a vague sense of a model’s capability, they get a granular breakdown of its strengths and weaknesses, allowing them to make a procurement decision based on whether the AI operates safely under the specific constraints of their clinical environment. It transforms the evaluation from a leap of faith into a data-driven process.
What is your forecast for the adoption of safety-focused AI benchmarks in clinical settings over the next five years?
I am incredibly optimistic. I believe that within the next five years, benchmarks like the CSEDB will become a standard, non-negotiable part of the infrastructure for real-world AI rollout. The initial excitement around AI’s capabilities is maturing into a more sober understanding of its risks. Hospital leaders, regulators, and clinicians are all asking for more rigorous validation. We will see a significant shift in evaluation, moving away from asking “Can it answer medical questions?” to demanding “Can it operate safely and reliably under clinical pressure?” These frameworks will become integral to procurement, continuous monitoring, and regulatory oversight, ultimately building the trust required to fully integrate these powerful technologies into patient care.
