What happens when genAI vendors kill off their best sources?

What happens when genAI vendors kill off their best sources?

If you think the latest generative AI (genAI) tools such as Google AI Overviews and OpenAI GPT-4o will change the world, you’re right. They will. But will they change it for the better? That’s another question.

I’ve been playing with both tools (and other genAI programs, as well). I’ve found they’re still prone to hallucinations, but sound more convincing than ever. That’s not a good thing.

One of the reasons I’m still making a living as a tech journalist is because I’m very good at discerning fact from fantasy. Part of that skill set comes from being an excellent researcher. The large language models (LLM) that underpin genAI chatbots…, not so much. Today, and for the foreseeable future, at their best, genAI is really just very good at copying and pasting from the work of others. 

That means the results they spit out are only as good as their sources. Look at it this way: if I want to know about the latest news, I go to The New York Times, the Washington Post, and the Wall Street Journal. Not only do I trust their reporters, but I know what their biases are. 

For example, I know I can believe what the Journal has to say about financial news, but I take their columnists with a huge grain of salt. (That’s just me; you might love them.)

As for the Times, remember it claims that OpenAI has stolen its stories to train ChatGPT — and if it wins its case, genAI is in trouble. Because other publishers will follow in quick succession. When that happens, all the genAI engines will have to steal — uhm, learn — their content from the likes of Reddit; your “private” Slack messages; and Stack Overflow, where users are sabotaging their answers to screw up OpenAI

That’s not going to go well. There’s a reason genAI engines often spew garbage; it’s what they were trained on. For instance, 80% of OpenAI GPT-3 tokens come from Common Crawl. Like the name says, these petabytes of data are scraped from everywhere and anywhere on the web. As a Mozilla Foundation study found, the result is not trustworthy AI.

Worse still, this will eventually lead to a time when those genAI tools start consuming their own garbage. This is a known problem that will cause model collapse. Or, as neuroscientist Erik Hoel pithily describes the end result: “synthetic garbage.” He’s not alone; many AI engineers think a little bit of AI-generated data can poison their LLMs.

At the same time, genAI companies aren’t doing us — or themselves, in the long run — any favors. For example, Google’s AI-powered “Overviews” provides concise AI summaries at the top of search results. This move promises quicker access to information, and Google’s Liz Reid claims it will drive more clicks to websites by piquing users’ interest.

Reid, who oversees search operations, maintains that AI Overviews really will encourage more searches and clicks to websites as users seek to “dig deeper” after getting the initial synthesized summary.

Publishers know better. Who will bother to go to the real story, which might require a subscription or — horrors —seeing an ad?  

Danielle Coffee, CEO of the News Media Alliance (it represents more than 2,200 publishers) warns that the change could be “catastrophic” for an industry already struggling with declining ad revenue. “It’s offensive and potentially unlawful for a dominant monopoly like Google to dictate the rules in a way that sacrifices the interests of publishers and creators,” she said.

Google has never been a friend to publishers. Just ask leaders in countries like Spain or Canada, where the government tried to get Google to pay publishers for access to their news sites. 

If Google, Microsoft, and other genAI companies keep all those search visitors (and ad revenues) to themselves, as I expect will be the case, publications will die at an even faster rate. And there goes any authoritative information Google and the other AI services need for their LLMs. 

OpenAI’s co-founder, Sam Altman, recently said, “GPT-4 is the dumbest model any of you will ever have to use again by a lot” and that “GPT-5 is going to be a lot smarter.”

I’m sure it will be. GPT-4o is clearly superior to its predecessor and GPT-5 will continue the trend. But GPT-6 and beyond? Simple greed may ensure that, as reliable human-created stories disappear, AI will only get dumber and dumber.  

In short, we’re looking at a future filled with AI GIGO: Garbage In, Garbage Out. No one wants that. The time to stop it is now.