The Internet’s Library is Under Threat

News orgs block the Internet Archive's crawlers—and undermine the truth

Apr 28, 2026

My 10-Day Online Safety Tuneup will be distributed on April 30! Get the goods—recently described as “life-saving” and “truly invaluable”—by becoming a paid subscriber today!

The first time I learned about the Internet Archive and its Wayback Machine—a library of over one trillion snapshots of webpages since the earliest days of the net—was in my high school AP US Government class. It was the early 2000s, and Mr. Fenster, our teacher, was showing us how to access a slice of history. Though I’ve long forgotten the exact website we were looking at—an early version of nytimes.com, maybe?—I remember being bemused at how the internet used to look.

Today, I rely on the Internet Archive for my research. I have a Wayback Machine sticker on my laptop. The Archive has helped me understand the impact malicious lies might have on my safety and allowed me to track changing narratives from both state-backed actors and independent grifters. It has tried to equalize access to information.

But today, the Internet Archive’s—and by extension, the world’s—access to 23 major news sites is under threat. It’s not because they are worried about paywall hopping. Their decision is motivated by something much dumber: the AI industry’s insatiable appetite for information.

Buckle up, friends—generative AI isn’t just hallucinating “facts” and creating deep fakes anymore. Now, it’s directly undermining the historical record.

I was lucky enough to visit the home of the Internet Archive in San Francisco last year. It's an old Christian Science Church. Servers—and funny paper mache statues of current and former employees—are in the sanctuary.

What is the Internet Archive?

The Internet Archive is best known for the Wayback Machine, which allows anyone to take a snapshot of what a webpage looked like at a given time, preserving that moment for the future. If a news outlet issues a correction or a politician makes a surreptitious edit to their website, the Archive provides a definitive record of a website at a specific moment, so long as someone—or an automated crawler—archives it.

The Archive’s founder, Brewster Kahle, says his goal in founding the Archive in 1996 was “to be a record of what happened so that people can’t rewrite history.” That goal has become a pressing need under the second Trump Administration, which has worked to suppress inconvenient facts and censor information it doesn’t like.

Less well known are the Archive’s libraries of 2 million books and 3 million hours of television, including an archive of state-run propaganda from the likes of Russia...and corporate propaganda from the likes of Fox News. The latter helped me understand the extent to which Fox had lied about me in 2022-2023. Since I don’t watch much TV, let alone keep Fox on all day and night, I needed a way to understand how often Fox had talked about me. The Archive provided it. By my accounting, I was among the top 20 Democrats that Fox discussed, even months after I left government service.

Don't call it a comeback

Nina Jankowicz and The American Sunlight Project

April 28, 2025

Read full story

That’s a personal example, but the professional list is long and far-reaching. I’ve used the Archive to access content that has been taken down. Kat Tenbarge writes, “I’ve used the Wayback Machine to visit archived webpages from more than a decade ago, some of which helped corroborate serious allegations. I use the Wayback Machine regularly to figure out how many followers someone had at a given time, to resurface since-deleted material, and to find context for historical posts.”

In my research, I regularly Archive public interest links myself in order to preserve access to them and create a historical record. I’ve been burned too many times by dead links in other people’s work, and that’s not their fault; Kahle points out that the life of the average webpage is only 100 days.

Why Are News Sites Blocking the Wayback Machine’s Crawlers?

According to reporting from WIRED, it’s not the Internet Archive itself that has led 23 news websites to block the Archive’s crawlers. It’s AI:

USA Today Co. spokesperson Lark-Marie Anton emphasized that “this effort is not about specifically blocking the Internet Archive” but instead part of the company’s broader efforts to block all scraping bots. Robert Hahn, the Guardian’s director of business affairs and licensing, says that it has been in conversation with the Archive over “concerns over potential misuse by AI companies of content sets crawled for preservation purposes.”

In plain English, there are two worries at play here:

AI models need new data to improve, and they’re using web crawlers not dissimilar to the Archive’s to hoover up that data. The Archive’s web crawler is collateral damage.
News organizations (rightfully) want to be compensated if their work is used to train AI models. Instead of (or sometimes, in addition to) concluding agreements with those organizations, AI companies may be scraping the Archive’s data. News organizations are (again, rightfully) angry.

While the Internet Archive recently told WIRED it is “in conversation” with the New York Times and other outlets about a resolution, those outlets have been issuing mealymouthed comments to the press that appear to be the PR equivalent of “it’s not you, it’s me.” The Archive isn’t the problem, they’ve said, it’s those bad AI companies! Now, I’m not the most technical person, but I imagine it would be relatively easy for news organizations to whitelist the Archive’s crawlers while blocking the malicious ones from scraping their sites. Instead, they’re choosing the nuclear option at a time when the Archive’s work couldn’t be more important. These organizations are also shooting themselves in the foot. Even as USA Today Co announced it would be blocking Archive crawlers, its journalists relied on the service for a recent investigation on ICE immigration data.

As journalists organizing to support the Archive wrote:

In previous generations, journalists would turn to the physical archives of a local newspaper or of a local public library to access historical reporting and follow the threads of the present back into history. With many newspapers closed, and no clear path for local public libraries to preserve digital-only reporting, the work of safeguarding journalism’s record increasingly falls to the Internet Archive.

What You Can Do

I assume you are as incensed about these attacks on the Internet Archive as you would be if you learned your local library couldn’t acquire books anymore. That’s essentially what’s happening here—along with the undermining of current and future journalism, accountability, the historical record, and the truth.

But you can push back:

Archive stuff. I’m going to require that my graduate students archive sources for their papers, and as a rule, I do my best to archive sources I use in my work or shady stuff I see online. You can do the same. Simply add the url to the box on the right hand side of web.archive.org.
Donate. The Internet Archive is a non-profit. They’re being attacked by extremely powerful forces. Need I say more?

What’s your earliest memory of the Wayback Machine? What do you use the Internet Archive for? Let’s make the comments a love fest for an OG internet gem. 🧭

Don't call it a comeback

Discussion about this post

Ready for more?