AI Has Created a Battle Over Web Crawling

AI Has Created a Battle Over Web Crawling

Most people assume that
generative AI will keep getting better and better; after all, that’s been the trend so far. And it may do so. But what some people don’t realize is that generative AI models are only as good as the ginormous data sets they’re trained on, and those data sets aren’t constructed from proprietary data owned by leading AI companies like OpenAI and Anthropic. Instead, they’re made up of public data that was created by all of us—anyone who’s ever written a blog post, posted a video, commented on a Reddit thread, or done basically anything else online.

A new report from the
Data Provenance Initiative, a volunteer collective of AI researchers, shines a light on what’s happening with all that data. The report, “Consent in Crisis: The Rapid Decline of the AI Data Commons,” notes that a significant number of organizations that feel threatened by generative AI are taking measures to wall off their data. IEEE Spectrum spoke with Shayne Longpre, a lead researcher with the Data Provenance Initiative, about the report and its implications for AI companies.

Shayne Longpre on:

How websites keep out web crawlers, and whyDisappearing data and what it means for AI companiesSynthetic data, peak data, and what happens next

The technology that websites use to keep out web crawlers isn’t new—the robot exclusion protocol was introduced in 1995. Can you explain what it is and why it suddenly became so relevant in the age of generative AI?

Shayne Longpre

Shayne Longpre: Robots.txt is a machine-readable file that crawlers—bots that navigate the web and record what they see—use to determine whether or not to crawl certain parts of a website. It became the de facto standard in the age where websites used it primarily for directing web search. So think of Bing or Google Search; they wanted to record this information so they could improve the experience of navigating users around the web. This was a very symbiotic relationship because web search operates by sending traffic to websites and websites want that. Generally speaking, most websites played well with most crawlers.

Let me next talk about a chain of claims that’s important to understand this. General-purpose AI models and their very impressive capabilities rely on the scale of data and compute that have been used to train them. Scale and data really matter, and there are very few sources that provide public scale like the web does. So many of the foundation models were trained on [data sets composed of] crawls of the web. Under these popular and important data sets are essentially just websites and the crawling infrastructure used to collect and package and process that data. Our study looks at not just the data sets, but the preference signals from the underlying websites. It’s the supply chain of the data itself.

But in the last year, a lot of websites have started using robots.txt to restrict bots, especially websites that are monetized with advertising and paywalls—so think news and artists. They’re particularly fearful, and maybe rightly so, that generative AI might impinge on their livelihoods. So they’re taking measures to protect their data.

When a site puts up robots.txt restrictions, it’s like putting up a no trespassing sign, right? It’s not enforceable. You have to trust that the crawlers will respect it.

Longpre: The tragedy of this is that robots.txt is machine-readable but does not appear to be legally enforceable. Whereas the terms of service may be legally enforceable but are not machine-readable. In the terms of service, they can articulate in natural language what the preferences are for the use of the data. So they can say things like, “You can use this data, but not commercially.” But in a robots.txt, you have to individually specify crawlers and then say which parts of the website you allow or disallow for them. This puts an undue burden on websites to figure out, among thousands of different crawlers, which ones correspond to uses they would like and which ones they wouldn’t like.

Do we know if crawlers generally do respect the restrictions in robots.txt?

Longpre: Many of the major companies have documentation that explicitly says what their rules or procedures are. In the case, for example, of Anthropic, they do say that they respect the robots.txt for ClaudeBot. However, many of these companies have also been in the news lately because they’ve been accused of not respecting robots.txt and crawling websites anyway. It isn’t clear from the outside why there’s a discrepancy between what AI companies say they do and what they’re being accused of doing. But a lot of the pro-social groups that use crawling—smaller startups, academics, nonprofits, journalists—they tend to respect robots.txt. They’re not the intended target of these restrictions, but they get blocked by them.

back to top

In the report, you looked at three training data sets that are often used to train generative AI systems, which were all created from web crawls in years past. You found that from 2023 to 2024, there was a very significant rise in the number of crawled domains that had since been restricted. Can you talk about those findings?

Longpre: What we found is that if you look at a particular data set, let’s take C4, which is very popular, created in 2019—in less than a year, about 5 percent of its data has been revoked if you respect or adhere to the preferences of the underlying websites. Now 5 percent doesn’t sound like a ton, but it is when you realize that this portion of the data mainly corresponds to the highest quality, most well-maintained, and freshest data. When we looked at the top 2,000 websites in this C4 data set—these are the top 2,000 by size, and they’re mostly news, large academic sites, social media, and well-curated high-quality websites—25 percent of the data in that top 2,000 has since been revoked. What this means is that the distribution of training data for models that respect robots.txt is rapidly shifting away from high-quality news, academic websites, forums, and social media to more organization and personal websites as well as e-commerce and blogs.

That seems like it could be a problem if we’re asking some future version of ChatGPT or Perplexity to answer complicated questions, and it’s taking the information from personal blogs and shopping sites.

Longpre: Exactly. It’s difficult to measure how this will affect models, but we suspect there will be a gap between the performance of models that respect robots.txt and the performance of models that have already secured this data and are willing to train on it anyway.

But the older data sets are still intact. Can AI companies just use the older data sets? What’s the downside of that?

Longpre: Well, continuous data freshness really matters. It also isn’t clear whether robots.txt can apply retroactively. Publishers would likely argue they do. So it depends on your appetite for lawsuits or where you also think that trends might go, especially in the U.S., with the ongoing lawsuits surrounding fair use of data. The prime example is obviously The New York Times against OpenAI and Microsoft, but there are now many variants. There’s a lot of uncertainty as to which way it will go.

The report is called “Consent in Crisis.” Why do you consider it a crisis?

Longpre: I think that it’s a crisis for data creators, because of the difficulty in expressing what they want with existing protocols. And also for some developers that are non-commercial and maybe not even related to AI—academics and researchers are finding that this data is becoming harder to access. And I think it’s also a crisis because it’s such a mess. The infrastructure was not designed to accommodate all of these different use cases at once. And it’s finally becoming a problem because of these huge industries colliding, with generative AI against news creators and others.

What can AI companies do if this continues, and more and more data is restricted? What would their moves be in order to keep training enormous models?

Longpre: The large companies will license it directly. It might not be a bad outcome for some of the large companies if a lot of this data is foreclosed or difficult to collect, it just creates a larger capital requirement for entry. I think big companies will invest more into the data collection pipeline and into gaining continuous access to valuable data sources that are user-generated, like YouTube and GitHub and Reddit. Acquiring exclusive access to those sites is probably an intelligent market play, but a problematic one from an antitrust perspective. I’m particularly concerned about the exclusive data acquisition relationships that might come out of this.

back to top

Do you think synthetic data can fill the gap?

Longpre: Big companies are already using synthetic data in large quantities. There are both fears and opportunities with synthetic data. On one hand, there have been a series of works that have demonstrated the potential for model collapse, which is the degradation of a model due to training on poor synthetic data that may appear more often on the web as more and more generative bots are let loose. However, I think it’s unlikely that large models will be hampered much because they have quality filters, so the poor quality or repetitive stuff can be siphoned out. And the opportunities of synthetic data are when it’s created in a lab environment to be very high quality, and it’s targeting particularly domains that are underdeveloped.

Do you give credence to the idea that we may be at peak data? Or do you feel like that’s an overblown concern?

Longpre: There is a lot of untapped data out there. But interestingly, a lot of it is hidden behind PDFs, so you need to do OCR [optical character recognition]. A lot of data is locked away in governments, in proprietary channels, in unstructured formats, or difficult to extract formats like PDFs. I think there’ll be a lot more investment in figuring out how to extract that data. I do think that in terms of easily available data, many companies are starting to hit walls and turning to synthetic data.

What’s the trend line here? Do you expect to see more websites putting up robots.txt restrictions in the coming years?

Longpre: We expect the restrictions to rise, both in robots.txt and in terms of service. Those trend lines are very clear from our work, but they could be affected by external factors such as legislation, companies themselves changing their policies, the outcome of lawsuits, as well as community pressure from writers’ guilds and things like that. And I expect that the increased commoditization of data is going to cause more of a battlefield in this space.

What would you like to see happen in terms of either standardization within the industry to making it easier for websites to express preferences about crawling?

Longpre: At the Data Province Initiative, we definitely hope that new standards will emerge and be adopted to allow creators to express their preferences in a more granular way around the uses of their data. That would make the burden much easier on them. I think that’s a no-brainer and a win-win. But it’s not clear whose job it is to create or enforce these standards. It would be amazing if the [AI] companies themselves could come to this conclusion and do it. But the designer of the standard will almost inevitably have some bias towards their own use, especially if it’s a corporate entity.

It’s also the case that preferences shouldn’t be respected in all cases. For instance, I don’t think that academics or journalists doing prosocial research should necessarily be foreclosed from accessing data with machines that is already public, on websites that anyone could go visit themselves. Not all data is created equal and not all uses are created equal.

back to top

Please follow and like us:
Pin Share