Chatbots provided incorrect, conflicting medical advice, researchers found: “Despite all the hype, AI just isn’t ready to take on the role of the physician.”

“In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice,” the study’s authors wrote. “One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care.”

  • rumba@lemmy.zip
    link
    fedilink
    English
    arrow-up
    13
    ·
    2 hours ago

    Chatbots make terrible everything.

    But an LLM properly trained on sufficient patient data metrics and outcomes in the hands of a decent doctor can cut through bias, catch things that might fall through the cracks and pack thousands of doctors worth of updated CME into a thing that can look at a case and go, you know, you might want to check for X. The right model can be fucking clutch at pointing out nearly invisible abnormalities on an xray.

    You can’t ask an LLM trained on general bullshit to help you diagnose anything. You’ll end up with 32,000 Reddit posts worth of incompetence.

    • SuspciousCarrot78@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      29 minutes ago

      Agree.

      I’m sorta kicking myself I didn’t sign up for Google’s MedPALM-2 when I had the chance. Last I checked, it passed the USMLE exam with 96% and 88% on radio interpretation / report writing.

      I remember looking at the sign up and seeing it requested credit card details to verify identity (I didn’t have a google account at the time). I bounced… but gotta admit, it might have been fun to play with.

      Oh well; one door closes another opens

  • alzjim@lemmy.world
    link
    fedilink
    English
    arrow-up
    13
    ·
    5 hours ago

    Calling chatbots “terrible doctors” misses what actually makes a good GP — accessibility, consistency, pattern recognition, and prevention — not just physical exams. AI shines here — it’s available 24/7 🕒, never rushed or dismissive, asks structured follow-up questions, and reliably applies up-to-date guidelines without fatigue. It’s excellent at triage — spotting red flags early 🚩, monitoring symptoms over time, and knowing when to escalate to a human clinician — which is exactly where many real-world failures happen. AI shouldn’t replace hands-on care — and no serious advocate claims it should — but as a first-line GP focused on education, reassurance, and early detection, it can already reduce errors, widen access, and ease overloaded systems — which is a win for patients 💙 and doctors alike.

    /s

    • plyth@feddit.org
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 hour ago

      The /s was needed for me. There are already more old people than the available doctors can handle. Instead of having nothing what’s wrong with an AI baseline?

  • SuspciousCarrot78@lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    ·
    edit-2
    23 minutes ago

    So, I can speak to this a little bit, as it touches two domains I’m involved it. TL;DR - LLMs bullshit and are unreliable, but there’s a way to use them in this domain as a force multiplier of sorts.

    In one; I’ve created a python router that takes my (deidentified) clinical notes, extracts and compacts input (user defined rules), creates a summary, then -

    1. benchmarks the summary against my (user defined) gold standard and provides management plan (again, based on user defined database).

    2. this is then dropped into my on device LLM for light editing and polishing to condense, which I then eyeball, correct and then escalate to supervisor for review.

    Additionally, the llm generated note can be approved / denied by the python router, in the first instance, based on certain policy criteria I’ve defined.

    It can also suggest probable DDX based on my database (which are .CSV based)

    Finally, if the llm output fails policy check, the router tells me why it failed and just says “go look at the prior summary and edit it yourself”.

    This three step process takes the tedium of paperwork from 15-20 mins to 1 minute generation, 2 mins manual editing, which is approx a 5-7x speed up.

    The reason why this is interesting:

    All of this runs within the llm (it calls / invokes the python tooling via >> command) and is 100% deterministic; no llm jazz until the final step, which the router can outright reject and is user auditble anyway.

    Ive found that using a fairly “dumb” llm (Qwen2.5-1.5B), with settings dialed down, produces consistently solid final notes (5 out of 6 are graded as passed on first run by router invoking policy document and checking output). Its too dumb to jazz, which is useful in this instance.

    Would I trust the LLM, end to end? Well, I’d trust my system, approx 80% of the time. I wouldn’t trust ChatGPT … even though its been more right than wrong in similar tests.

      • SuspciousCarrot78@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        45 minutes ago

        Depends which bit you mean specifically.

        The “router” side is a offshoot of a personal project. It’s python scripting and a few other tricks, such as JSON files etc. Full project details for that here

        https://github.com/BobbyLLM/llama-conductor

        The tech stack itself:

        • llama.cpp
        • Qwen 2.5-1.5 GGUF base (by memory, 5 bit quant from HF Alibaba repository)
        • The python router (more sophisticated version of above)
        • Policy documents
        • Front end (OWUI - may migrate to something simpler / more robust. Occasional streaming disconnect issues at moment. Annoying but not terminal)
        • realitista@lemmus.org
          link
          fedilink
          English
          arrow-up
          1
          ·
          14 minutes ago

          Thanks it’s really interesting to see some real work applications and implementations of AI for practical workloads.

          • SuspciousCarrot78@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            4 minutes ago

            Very welcome :)

            As it usually goes with these things, I built it for myself then realised it might have actual broader utility. We shall see!

  • pleksi@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    5
    ·
    5 hours ago

    As a phycisian ive used AI to check if i have missed anything in my train of thought. Never really changed my decision though. Has been useful to hather up relevant sitations for my presentations as well. But that’s about it. It’s truly shite at interpreting scientific research data on its own for example. Most of the time it will parrot the conclusions of the authors.

  • BeigeAgenda@lemmy.ca
    link
    fedilink
    English
    arrow-up
    51
    ·
    15 hours ago

    Anyone who have knowledge about a specific subject says the same: LLM’S are constantly incorrect and hallucinate.

    Everyone else thinks it looks right.

    • tyler@programming.dev
      link
      fedilink
      English
      arrow-up
      6
      ·
      3 hours ago

      That’s not what the study showed though. The LLMs were right over 98% of the time…when given the full situation by a “doctor”. It was normal people who didn’t know what was important that were trying to self diagnose that were the problem.

      Hence why studies are incredibly important. Even with the text of the study right in front of you, you assumed something that the study did not come to the same conclusion of.

    • agentTeiko@piefed.social
      link
      fedilink
      English
      arrow-up
      5
      ·
      10 hours ago

      Yep its why CLevels think its the Holy Grail they don’t see it as everything that comes out of their mouth is bullshit as well. So they don’t see the difference.

    • IratePirate@feddit.org
      link
      fedilink
      English
      arrow-up
      28
      ·
      edit-2
      14 hours ago

      A talk on LLMs I was listening to recently put it this way:

      If we hear the words of a five-year-old, we assume the knowledge of a five-year-old behind those words, and treat the content with due suspicion.

      We’re not adapted to something with the “mind” of a five-year-old speaking to us in the words of a fifty-year-old, and thus are more likely to assume competence just based on language.

      • leftzero@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        10
        arrow-down
        1
        ·
        9 hours ago

        LLMs don’t have the mind of a five year old, though.

        They don’t have a mind at all.

        They simply string words together according to statistical likelihood, without having any notion of what the words mean, or what words or meaning are; they don’t have any mechanism with which to have a notion.

        They aren’t any more intelligent than old Markov chains (or than your average rock), they’re simply better at producing random text that looks like it could have been written by a human.

        • plyth@feddit.org
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 hour ago

          They simply string words together according to statistical likelihood, without having any notion of what the words mean

          What gives you the confidence that you don’t do the same?

        • IratePirate@feddit.org
          link
          fedilink
          English
          arrow-up
          4
          arrow-down
          1
          ·
          6 hours ago

          I am aware of that, hence the ""s. But you’re correct, that’s where the analogy breaks. Personally, I prefer to liken them to parrots, mindlessly reciting patterns they’ve found in somebody else’s speech.

    • zewm@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      3
      ·
      14 hours ago

      It is insane to me how anyone can trust LLMs when their information is incorrect 90% of the time.

    • rudyharrelson@lemmy.radio
      link
      fedilink
      English
      arrow-up
      107
      arrow-down
      2
      ·
      edit-2
      16 hours ago

      People always say this on stories about “obvious” findings, but it’s important to have verifiable studies to cite in arguments for policy, law, etc. It’s kinda sad that it’s needed, but formal investigations are a big step up from just saying, “I’m pretty sure this technology is bullshit.”

      I don’t need a formal study to tell me that drinking 12 cans of soda a day is bad for my health. But a study that’s been replicated by multiple independent groups makes it way easier to argue to a committee.

      • irate944@piefed.social
        link
        fedilink
        English
        arrow-up
        34
        arrow-down
        1
        ·
        16 hours ago

        Yeah you’re right, I was just making a joke.

        But it does create some silly situations like you said

          • IratePirate@feddit.org
            link
            fedilink
            English
            arrow-up
            8
            arrow-down
            1
            ·
            14 hours ago

            A critical, yet respectful and understanding exchange between two individuals on the interwebz? Boy, maybe not all is lost…

      • Knot@lemmy.zip
        link
        fedilink
        English
        arrow-up
        19
        ·
        15 hours ago

        I get that this thread started from a joke, but I think it’s also important to note that no matter how obvious some things may seem to some people, the exact opposite will seem obvious to many others. Without evidence, like the study, both groups are really just stating their opinions

        It’s also why the formal investigations are required. And whenever policies and laws are made based on verifiable studies rather than people’s hunches, it’s not sad, it’s a good thing!

      • BillyClark@piefed.social
        link
        fedilink
        English
        arrow-up
        7
        ·
        15 hours ago

        it’s important to have verifiable studies to cite in arguments for policy, law, etc.

        It’s also important to have for its own merit. Sometimes, people have strong intuitions about “obvious” things, and they’re completely wrong. Without science studying things, it’s “obvious” that the sun goes around the Earth, for example.

        I don’t need a formal study to tell me that drinking 12 cans of soda a day is bad for my health.

        Without those studies, you cannot know whether it’s bad for your health. You can assume it’s bad for your health. You can believe it’s bad for your health. But you cannot know. These aren’t bad assumptions or harmful beliefs, by the way. But the thing is, you simply cannot know without testing.

        • Slashme@lemmy.world
          link
          fedilink
          English
          arrow-up
          4
          ·
          5 hours ago

          Or how bad something is. “I don’t need a scientific study to tell me that looking at my phone before bed will make me sleep badly”, but the studies actually show that the effect is statistically robust but small.

          In the same way, studies like this can make the distinction between different levels of advice and warning.

      • Telorand@reddthat.com
        link
        fedilink
        English
        arrow-up
        9
        arrow-down
        1
        ·
        16 hours ago

        The thing that frustrates me about these studies is that they all continue to come to the same conclusions. AI has already been studied in mental health settings, and it’s always performed horribly (except for very specific uses with professional oversight and intervention).

        I agree that the studies are necessary to inform policy, but at what point are lawmakers going to actually lay down the law and say, “AI clearly doesn’t belong here until you can prove otherwise”? It feels like they’re hemming and hawwing in the vain hope that it will live up to the hype.

      • Eager Eagle@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        16 hours ago

        Also, it’s useful to know how, when, or why something happens. I can make a useless chatbot that is “right” most times if it only tells people to seek medical help.

    • hansolo@lemmy.today
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      15 hours ago

      I’m going to start telling people I’m getting a Master’s degree in showing how AI is bullshit. Then I point out some AI slop and mumble about crushing student loan debt.

    • sbbq@lemmy.zip
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      3
      ·
      11 hours ago

      My dad always said, you know what they call the guy who graduated last in his class at med school? Doctor.

  • Sterile_Technique@lemmy.world
    link
    fedilink
    English
    arrow-up
    17
    arrow-down
    1
    ·
    edit-2
    14 hours ago

    Chipmunks, 5 year olds, salt/pepper shakers, and paint thinner, also all make terrible doctors.

    Follow me for more studies on ‘shit you already know because it’s self-evident immediately upon observation’.

    • scarabic@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      11 hours ago

      It’s actually interesting. They found the LLMs gave the correct diagnosis high-90-something percent of the time if they had access to the notes doctors wrote about their symptoms. But when thrust into the room, cold, with patients, the LLMs couldn’t gather that symptom info themselves.

  • GnuLinuxDude@lemmy.ml
    link
    fedilink
    English
    arrow-up
    13
    ·
    16 hours ago

    If you want to read an article that’s optimistic about AI and healthcare, but where if you start asking too many questions it falls apart, try this one

    https://text.npr.org/2026/01/30/nx-s1-5693219/

    Because it’s clear that people are starting to use it and many times the successful outcome is it just tells you to see a doctor. And doctors are beginning to use it, but they should have the professional expertise to understand and evaluate the output. And we already know that LLMs can spout bullshit.

    For the purposes of using and relying on it, I don’t see how it is very different from gambling. You keep pulling the lever, oh excuse me I mean prompting, until you get the outcome you want.

    • MinnesotaGoddam@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      10 hours ago

      the one time my doctor used it and i didn’t get mad at them (they did the google and said “the ai says” and I started making angry Nottingham noises even though all the ai did was tell us exactly what we had just been discussing was correct) uh, well that’s pretty much it I’m not sure where my parens are supposed to open and close on that story.

      • GnuLinuxDude@lemmy.ml
        link
        fedilink
        English
        arrow-up
        7
        ·
        8 hours ago

        Be glad it was merely that and not something like this https://www.reuters.com/investigations/ai-enters-operating-room-reports-arise-botched-surgeries-misidentified-body-2026-02-09/

        In 2021, a unit of healthcare giant Johnson & Johnson announced “a leap forward”: It had added artificial intelligence to a medical device used to treat chronic sinusitis, an inflammation of the sinuses…

        At least 10 people were injured between late 2021 and November 2025, according to the reports. Most allegedly involved errors in which the TruDi Navigation System misinformed surgeons about the location of their instruments while they were using them inside patients’ heads during operations.

        Cerebrospinal fluid reportedly leaked from one patient’s nose. In another reported case, a surgeon mistakenly punctured the base of a patient’s skull. In two other cases, patients each allegedly suffered strokes after a major artery was accidentally injured.

        FDA device reports may be incomplete and aren’t intended to determine causes of medical mishaps, so it’s not clear what role AI may have played in these events. The two stroke victims each filed a lawsuit in Texas alleging that the TruDi system’s AI contributed to their injuries. “The product was arguably safer before integrating changes in the software to incorporate artificial intelligence than after the software modifications were implemented,” one of the suits alleges.