• SocialMediaRefugee@lemmy.ml
    link
    fedilink
    arrow-up
    10
    ·
    2 days ago

    wtf, I checked out the claim about states with R in their name and it was pretty ridiculous how badly it answered it. It answers like someone running through their thoughts in front of you.

    ====================================================

    list all of the US states with R in their name

    ChatGPT said: Here’s the full list of U.S. states that have the letter R in their name:

    Arizona

    Arkansas

    California

    Colorado

    Delaware

    Florida

    Georgia

    Illinois

    Maryland

    Massachusetts

    Missouri

    Nebraska

    North Carolina

    North Dakota

    Oregon

    South Carolina

    Texas – (wait… nope, no R here! ❌)

    Vermont

    Virginia

    Washington – (nope, no R either ❌)

    West Virginia

    Wisconsin – (no R ❌)

    Wyoming – (no R ❌)

    Let me clean that up so only correct ones remain:

    Final Correct List (22 states):

    Arizona

    Arkansas

    California

    Colorado

    Delaware

    Florida

    Georgia

    Maryland

    Missouri

    Nebraska

    North Carolina

    North Dakota

    Oregon

    Vermont

    Virginia

    West Virginia

    New Hampshire

    New Jersey

    New York

    Massachusetts

    Connecticut – (wait, no R ❌)

    Minnesota – (no R ❌)

    Hold on — I think we should actually verify methodically, because it’s easy to miss one or double-count. Do you want me to go alphabetically through all 50 states and pick only the ones with “R” so we’re 100% sure? That way we won’t miss any.

    • SocialMediaRefugee@lemmy.ml
      link
      fedilink
      arrow-up
      5
      ·
      2 days ago

      I tried this with MS Copilot and got this:

      ===========================================

      list all of the US states with R in their name

      Sure! Here’s a list of all U.S. states that contain the letter “R” in their name: 🗺️ U.S. States with “R” in the Name

      • Arizona
      • Arkansas
      • California
      • Colorado
      • Delaware
      • Florida
      • Georgia
      • Maryland
      • Massachusetts
      • Minnesota
      • Missouri
      • Nebraska
      • New Hampshire
      • New Jersey
      • New York
      • North Carolina
      • North Dakota
      • Oregon
      • South Carolina
      • Texas
      • Vermont
      • Virginia
      • Washington
      • West Virginia
      • Wisconsin
      • Wyoming Let me know if you want this grouped by region or visualized in a map!
    • Eager Eagle@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      edit-2
      1 day ago

      LLMs fundamentally can’t answer this kind of question, so it’s an unfair test. They don’t see letters, they just see tokens.

      It’s a bit like asking a random person on the street the square root of 75. Most will just guess on the spot.

      • Vendetta9076@sh.itjust.works
        link
        fedilink
        arrow-up
        3
        ·
        20 hours ago

        I think this kind of stance is really interesting because of course you’re right but it doesn’t matter. It doesnt actually matter what an LLM can do, it matters what corpos tell us they can do. And according to the corpos this is child’s play.

      • jj4211@lemmy.world
        link
        fedilink
        arrow-up
        2
        arrow-down
        1
        ·
        19 hours ago

        Most will just guess on the spot.

        Well no, most would say “I don’t know”. Which an LLM is unlikely to do unless the training material shows that a consistent answer is “I don’t know”. It will give a fact shaped answer that may fail, but it’s at least ‘shaped’ right.

        Again, the big problem is not that LLMs are just useless because they can’t do these little specific tricks, it’s that it is important to continue calibrating expectations, particularly as, at this point, people have bet trillions on these things and that’s a lot of money to have people lie and cheat their way to make everyone overestimate them. Without counter perspective, I think 90% of my coworkers would have been laid off already as the executives just soak in unchallenged marketing bullshit from the big players.

        • jj4211@lemmy.world
          link
          fedilink
          arrow-up
          1
          ·
          14 hours ago

          A bit off topic, but that’s pretty much a result of “prompt stuffing”. Your prompt is processed into a good old fashioned search query and then the search results are sort of added to the prompt. Basically from the LLM perspective, it seems a request to rework your source material in a manner consistent with your prompt. The LLM is fed the correct answer, so it doesn’t have to answer, it just has to reword the input.

    • Hotzilla@sopuli.xyz
      link
      fedilink
      arrow-up
      1
      arrow-down
      1
      ·
      1 day ago

      GPT-5 without “thinking” mode got the answer wrong.

      GPT-5 with thinking answered:

      Here are the 21 US states with the letter “R” in their name:

      Arizona, Arkansas, California, Colorado, Delaware, Florida, Georgia, Maryland, Missouri, Nebraska, New Hampshire, New Jersey, New York, North Carolina, North Dakota, Oregon, Rhode Island, South Carolina, Vermont, Virginia, West Virginia.

      It wrote a script that verified it while doing the “thinking” (feeding the hallusinations back to the LLM)

      • skuzz@discuss.tchncs.de
        link
        fedilink
        arrow-up
        3
        ·
        20 hours ago

        “Thinking” mode is just sending wave upon wave of GPUs at the problem until the killbots hit their pre-set kill count. One could roughly simulate that by not using thinking mode and just feeding the answer and question back to the LLM repeatedly until it eventually gets an answer that might be “right”. These companies have hit a technological wall with LLMs and will do anything to attempt to look like they still have forward inertia.

        • jj4211@lemmy.world
          link
          fedilink
          arrow-up
          3
          ·
          20 hours ago

          Well, not quite, because they don’t have criteria for ‘right’.

          They do basically say ‘generate 10x more content than usual, then dispose of 90% of it’, and that surprisingly seems to largely improve results, but at no point is it ‘grading’ the result.

          Some people have bothered to provide ‘chain of thought’ examples and even when it’s largely ‘correct’, you may see a middle step be utterly flubbed in a way that should have fouled the whole thing, but the error is oddly isolated and doesn’t carry forward into the subsequent content, as would be the case in actual ‘reasoning’.