Last week, I shared how AI helped me get urgent medical attention when I really needed it. The response from my network was incredible—many people told me it was the first time they had considered using AI for something as critical as navigating the healthcare system. Even since then, a friend used ChatGPT for his own medical emergency, and, in a strange twist of fate, ended up in the same hospital as me.
It’s easy to walk away from these stories feeling like AI is a game-changer for patient autonomy. And in many ways, it is. But it would be a mistake to assume that AI is the end-all, be-all answer to every medical question.
Since my diagnosis, I’ve been feeding new test results and emerging theories into the same ChatGPT conversation thread I started last week. Every time a potential trigger for what may have caused my condition surfaces, I turn to AI first—digging into research studies, looking for patterns, running hypothetical scenarios.
While this may a good way to kill time in between specialist visits, at a certain point, AI stops feeling like a tool for understanding reality—and starts feeling like a tool for inventing one.
With just a few clicks into several AI deep search engines, I can generate a well-researched report arguing that a viral infection and a flu shot triggered an autoimmune response that led me to my ITP incident and platelet collapse event last week. But I can just as easily create an equally compelling case for a stress-induced trigger linked to my GI tract. Or even trace it back to something as seemingly innocent as the quinine in a gin and tonic I had two weeks before the incident.
Unlike a human doctor, the AI says yes, immediately, to any and all request for far-fetched ideas and conspiracy theories. It'll dig deep into the web to find sources that satiate my desire to find answers. And if it can’t find real studies to support my theory? No problem—it might just make them up.
This is a huge problem.
This issue of selective validation—where AI feeds into our biases—ties into a broader challenge: How we frame our questions in the first place.
As we’re learning, the way we prompt AI has a surprisingly wide impact on the quality and reliability of its responses. The challenge isn’t just that we lack a universal way to assess what makes a “good” prompt—it’s that even leading researchers struggle to find consistent benchmarks for guiding AI interactions.
This morning I read through the the first Prompt Engineering Report, published by the Generative AI Labs at the Wharton School of Business. In this study, led by Ethan Mollick and others, they sought to establish some clearer benchmarking rules and frameworks about "good" vs. "bad" prompts. The study specifically compares two recent models from OpenAI and included four variables in prompt types:
Four Variable Prompt Types Studied:
Formatted prompts (including instructions about what "correct" responses look, or desired format of the response)
Unformatted prompts (no structured instructions)
Polite prompts (including the word "please")
Commanding prompts (including the phrase, "I order you to...")
What they found is that there's no "magic wand" approach to AI prompting. Some strategies worked better in certain cases, but AI performance varied unpredictably across different questions. Even small tweaks to a prompt could improve—or degrade—accuracy.
In other words, it's complicated.
One of the frameworks I found particularly useful in the study was the idea of establishing different confidence intervals for accuracy benchmarks, depending on the problem at hand. In other words, not every situation requires the same level of precision—you naturally tolerate more room for error in some areas of life than in others.
Here's the framework shared in the study:
Knowing when you need a 100% correct answer versus a directionally correct one shapes how much error you’re willing to tolerate. Here are three different benchmarks to consider:
Complete Accuracy - 100% correct
High Accuracy - 90% correct
Majority Correct - 51% correct
Source: Prompting Science Report 1 (Generative AI Labs, Wharton School of Business)
Let's imagine you are ordering a sandwich for lunch at a bodega.
If you ask for turkey and Swiss cheese with mayo, mustard, lettuce, and tomato, but receive muenster instead of Swiss, it’s not ideal, but it’s still a reasonable substitute. But if they give you a ham sandwich instead of turkey, that’s probably a dealbreaker—you’re much more likely to send it back.
In other words, some mistakes are easy to overlook, while others completely change the outcome.
The same logic applies to AI. Just like when placing an order at a restaurant, when you "order" information from AI, you need to know how much inaccuracy you’re willing to tolerate. Sometimes, a slightly wrong answer is still useful. Other times, precision is everything.
When ChatGPT helped me realize I needed to go to the ER, I did not need 100% diagnostic accuracy. I just needed 100% certainty that my situation was an emergency—and there’s a big difference between the two.
We’re all still figuring this out in real time, but if there’s one key takeaway, it’s this: Before using AI, be clear about what you need—and how much error you’re willing to accept in the response.
Over 500 subscribers
After last week's AI superhero moment—where it helped me get urgent medical care when I needed it most—I kept using AI to parse my diagnosis. But at a certain point, AI stops feeling like a tool for understanding reality—and starts feeling like a tool for inventing one. A few thoughts on AI’s confirmation bias trap, and how to assess the level of accuracy you actually need in a response. https://hardmodefirst.xyz/ai-and-the-confirmation-bias-trap-know-what-youre-asking-for
i can’t believe that even with so much tech nothing automatically flagged the low platelets count on your report to you PCP. i hope you’re doing well now! thank you for sharing your experience and thoughts. i love that with AI we can easily understand things that were once alien to us because we didn’t hold a degree in that area. it gives everyone so much power.
omg i know, that's honestly the crazier part, should have been an auto-escalation for #'s that low and thank you, doing MUCH better this week so far!
Bethany Marz explores the complexities of relying on AI in healthcare. While AI can enhance patient empowerment, measuring accuracy and framing effective queries holds significant weight. Knowing how much error to tolerate can make all the difference. @bethanymarz