Larger AI chatbots often give incorrect answers over admitting uncertainty, study shows

Claudio Ctin10 hours ago6 hours ago5 mins

A study of recent, larger versions of three major AI chatbots reveals that they are more likely to provide incorrect answers than to admit their mistakes. The results, published in Nature on Wednesday (Sept. 25), also discovered that people often struggle to identify these errors.

ReadWrite has reported about how chatbots can “hallucinate” answers to queries in the past. Hence José Hernández-Orallo from the Valencian Research Institute for Artificial Intelligence in Spain, along with his colleagues, examined these misfires to understand how they evolve as AI models grow larger and use more training data. It also incorporates more parameters or decision-making nodes, consuming greater computing power.

They also investigated whether the amount of errors aligns with human perceptions of question difficulty and how effectively people can recognize incorrect answers.

Are AI LLMs trustworthy?

The team discovered that larger, more refined versions of large language models (LLMs) are more accurate, largely thanks to fine-tuning methods like reinforcement learning from human feedback. However, they are also less reliable. The researchers found that among all incorrect responses, the proportion of wrong answers has risen because these AI models are now less likely to avoid answering a question—such as admitting they don’t know or diverting the topic.

One of the researchers, Lexin Zhou, wrote on X: “LLMs are indeed less correct on tasks that humans consider difficult, but they still do succeed at difficult tasks before being flawless on easy tasks, leading to no safe operation conditions humans can identify where LLMs can be trusted.”

1/ New paper @Nature!

Discrepancy between human expectations of task difficulty and LLM errors harms reliability. In 2022, Ilya Sutskever @ilyasut predicted: “perhaps over time that discrepancy will diminish” (https://t.co/HADDUztzhu, min 61-64).

We show this is *not* the case! pic.twitter.com/u2HYQbWE4j

— Lexin Zhou (@lexin_zhou) September 25, 2024

He added that it was “concerning” that the latest LLMs improve mainly on the” high-difficulty instances,” exacerbating the discordance between human difficulty expectation and LLM success.

The team evaluated OpenAI’s GPT, Meta’s LLaMA, and BLOOM. They tested early and refined models on prompts covering arithmetic, geography, and information transformation. They found that accuracy improved with model size but dropped with more challenging questions.

Models, including GPT-4, often answered difficult questions, with wrong answers exceeding 60 percent for some refined models. Surprisingly, even easy questions were sometimes answered incorrectly. Volunteers misclassified inaccurate answers as correct 10 percent to 40 percent of the time, showcasing issues with supervising the models.

Hernández-Orallo suggests that developers should “boost AI performance on easy questions” and encourage chatbots to avoid answering difficult ones, allowing users to more accurately assess when AIs are reliable. He states, “We need humans to understand: ‘I can use it in this area, and I shouldn’t use it in that area’.”

Featured image: Ideogram

The post Larger AI chatbots often give incorrect answers over admitting uncertainty, study shows appeared first on ReadWrite.

Please follow and like us:

Stiri similare

Zillow is adding climate risk data to all US for-sale listings

Claudio Ctin1 hour ago53 mins ago

As extreme weather events become ever more common, climate risks are playing a role in many people’s long-term decision-making. And few things are more long-term than buying real estate. In response, Zillow has announced a new partnership to bring climate risk information to its for-sale listings. Property listing pages in the US will include data…

X suspends journalist Ken Klippenstein after he published J.D. Vance dossier

Claudio Ctin2 hours ago53 mins ago

X suspended journalist Ken Klippenstein’s account earlier this afternoon. X’s Safety account says they issued the temporary suspension “for violating our rules on posting unredacted private personal information, specifically Sen. [J.D.] Vance’s physical address and the majority of his social security number.” Several news outlets that received the vetting dossier of the Republican vice presidential…

See Hurricane Helene landfall live on these Florida beach cams

Claudio Ctin2 hours ago53 mins ago

Helene is, as of this writing on Thursday afternoon, a “dangerous major hurricane,” and conditions are expected to rapidly worsen in the next several hours as landfall approaches. Tweet may have been deleted The west coast of the Florida peninsula saw storm surge and rain all day, but what was visible Thursday may be deceptive,…

‘Ballerina’ trailer: Ana de Armas unleashes her inner John Wick

Claudio Ctin3 hours ago2 hours ago

“John Wick” spin-off “Ballerina” hits theaters June 6. Please follow and like us:

Child ‘content creators’ granted protections in California by Gov. Newsom

Claudio Ctin3 hours ago2 hours ago

California has taken a huge step in protecting children placed in the online spotlight, passing two new pieces of legislation providing financial safety nets for minors starring in digital content. Gov. Gavin Newsom was joined at the bill signing by singer and former child star Demi Lovato, who recently made headlines for her appearance in…

New California law will force companies to admit you don’t own digital content

Claudio Ctin3 hours ago2 hours ago

California Governor Gavin Newsom has signed AB 2426, a new law that requires digital marketplaces to make clearer to customers when they are only purchasing a license to access media. The law will not apply to cases of permanent offline downloads, only to the all-too-common situation of buying digital copies of video games, music, movies,…

Zillow will now show climate risks for property listings in the US

Claudio Ctin3 hours ago3 hours ago

Zillow will now display climate risks and make insurance recommendations for listings in the US. | Image: Zillow Zillow has announced that its real estate property listings in the US will soon feature details about climate risks, including the potential for wildfires, flooding, extreme temperatures, high winds, and poor air quality. Buyers will also see…

The Final Fantasy Pixel Remaster series finally arrives on Xbox

Claudio Ctin3 hours ago3 hours ago

Square Enix’s terrific Final Fantasy Pixel Remaster series has finally made its way to Xbox. The 1980s and ’90s classics, which arrived on PC and mobile starting in 2021 and Switch and PS4 last year, are now available on Xbox Series X/S. The Xbox Store sells the six-game series in a $75 bundle ($60 for…

Sony’s Horizon Zero Dawn remaster may cost $20 more than we thought

Claudio Ctin3 hours ago3 hours ago

Horizon Zero Dawn Remastered. | Image: Sony If you thought you’d buy a new copy of Horizon Zero Dawn on Sony’s digital storefront for a smooth $20 and just pay an extra $10 for the new remastered version when it arrives on October 31st, think again. Sony has quietly doubled the price of Horizon Zero…

Google’s NotebookLM can help you dive deeper into YouTube videos

Claudio Ctin3 hours ago3 hours ago

Illustration by Alex Castro / The Verge NotebookLM, Google’s AI note-taking app, can now summarize and help you dig deeper into YouTube videos. The new capability works by analyzing the text in a YouTube video’s transcript, including autogenerated ones. Once you add a YouTube link to NotebookLM, it will use AI to provide a brief…

X blocks links to hacked JD Vance dossier

Claudio Ctin4 hours ago3 hours ago

Illustration by Kristen Radtke / The Verge; Getty Images X is preventing users from posting links to a newsletter containing a hacked document that’s alleged to be the Trump campaign’s research into vice presidential candidate JD Vance. The journalist who wrote the newsletter, Ken Klippenstein, has been suspended from the platform. Searches for posts containing…

Ford’s BlueCruise 1.4 update lets you keep your hands off the wheel much longer

Claudio Ctin4 hours ago4 hours ago

BlueCruise asking the driver to put their hands on the wheel as rain falls. | Image: Umar Shakir / The Verge Ford is releasing a new version of its hands-free driving BlueCruise software, version 1.4, which it claims will let you keep your hands off the wheel twice as long. In fact, the company tells…

FCC fines political consultant $6 million for deepfake robocalls

Claudio Ctin4 hours ago4 hours ago

The Federal Communications Commission (FCC) has officially issued its full recommended fine against political consultant Steve Kramer for a series of illegal robocalls using deepfake AI technology and caller ID spoofing during the New Hampshire primaries. Kramer must pay $6 million in fines in the next 30 days or the Department of Justice will handle…

‘The Last of Us’ Season 2 teaser is here to bring you to tears

Claudio Ctin5 hours ago4 hours ago

“The Last of Us” Season 2, starring Pedro Pascal and Bella Ramsey, hits HBO and Max in 2025. Please follow and like us:

Nvidia’s RTX 5090 will reportedly include 32GB of VRAM and hefty power requirements

Claudio Ctin5 hours ago4 hours ago

Photo by Tom Warren / The Verge Nvidia is reportedly planning to ship its upcoming GeForce RTX 5090 graphics card with 32GB of GDDR7 memory. Hardware leaker Kopite7kimi has published rumored specifications for the RTX 5090 and RTX 5080 today, and both cards will reportedly be more power-hungry as Nvidia looks to debut more capable…

Twitch’s BibleThump will soon go to emote heaven

Claudio Ctin5 hours ago5 hours ago

Image: Edmund McMillen; The Verge Pretty soon, Twitch users will no longer be able to express their sadness with the BibleThump emote. According to Twitch, on September 30th, its rights to display the popular crying pink blob will expire after over a decade of being one of the foundational Twitch emotes along with Kappa, FrankerZ,…

An out-of-warranty battery almost left this paralyzed man’s exoskeleton useless

Claudio Ctin5 hours ago5 hours ago

Image: Michael Straight via Facebook Michael Straight, a former jockey paralyzed from the waist down, was left unable to walk for two months after the company behind his $100,000 exoskeleton refused to fix a battery issue, as reported earlier by the Paulick Report and 404 Media. “I called [the company] thinking it was no big…

A deepfake caller pretending to be a Ukrainian official almost tricked a US senator

Claudio Ctin6 hours ago5 hours ago

Cath Virginia / The Verge | Photos from Getty Images The head of the Senate Foreign Relations Committee took a Zoom call with someone using deepfake technology to pose as a top Ukrainian official, The New York Times reports. Sen. Ben Cardin (D-MD) received an email last Thursday that appeared to be from Dmytro Kuleba,…

Google’s new Nest Learning Thermostat is discounted for the first time

Claudio Ctin6 hours ago5 hours ago

Photo by Owen Grove / The Verge The recently launched Google Nest Learning Thermostat has its first notable discount, as Wellbots is offering it in all three color options for $259.99 ($20 off) with checkout code NLT4VERGE. Google took nearly nine years to replace the last model with this sleeker, Pixel Watch-looking design, but you…

Google Maps is cracking down on fake business reviews

Claudio Ctin6 hours ago5 hours ago

Businesses are trying to game Google Maps with fake reviews and Google has had enough. Google has started restricting profiles of businesses that are found to have hosted fake reviews. On its support website, Google laid out what exactly can happen to such businesses. Some of the possible punishments include, but are apparently not limited…

Volvo’s head of sustainability on why the brand tweaked its ‘EV or bust’ strategy

Claudio Ctin6 hours ago5 hours ago

Image: Volvo Earlier this month, Volvo became the latest automaker to announce that it was delaying its plans to sell only electric vehicles. The decision was a reflection of the stark reality of the market: the demand was just not there. “We reduced the ambitions we had set to go 100 percent electric by 2030,”…