[RERUN] Google Doesn't Forgive -- But What Does It Forget?

by Gattsuru — on

About a few years ago, Tim Bray reported that he's noticed an unusual gap in a simple search:

I think Google has stopped indexing the older parts of the Web. I think I can prove it.

Marco Fioretti goes into the analysis from another perspective :

Back in 2006, I published on one of my domains, digifreedom.net, the opinion piece “Seven Things we’re tired of hearing from software hackers”. A few years later, for reasons not relevant here, I froze that whole project. One unwanted consequence was that the “Seven Things”, together with other posts, were not accessible anymore. I was able to put the post back online only in December 2013, at a new URL on this other website. Last Saturday I needed to email that link to a friend and I had exactly the same experience as Bray: Google would only return links to mentions, or even to whole copies, but archived elsewhere...

Unlike Bray’s, my own post disappeared from the Web for a while, and then reappeared with the original date, but only after a few years, and in a different domain. This is an important difference which may mean that, in my case, part of Google’s failure is my own fault. Still, for all practical purposes, the result is the same:

DuckDuckGo gives as first result the most, if not the only correct answer to whoever would be interested in that post today: the current link to the original version, on the (current) website of its author. DuckDuckGo gets things right. Google does not (not at the time of writing, of course).

Like googlewhacks, publicizing any specific example of this problem enough results in it being fixed, and both Bray and Fioretti's vanished posts have since shown back up since they first revealed them. When ErosBlogBachus noticed the problem they posted an example, and it ended up taking somewhere under a month for "Dildoes in the Subway" to show back up on targetted searches. It's pretty trivial to look through any sufficiently old blog or forum and select a post at random to verify that it's not just these topics or writers, or to find your own example cases.

At a trivial level, this reveals a significant and serious gap: Bray compares the failure to dementia, and it's not a wrong metaphor. The internet never really had Fioretti's utopia of a "permanent, long-lived store of humanity’s intellectual heritage" -- the collapse of UseNet, Geocities, and a thousand small webhosts (and recently, even archives of UseNet posts!) are just examples of a problem that dates back before Eternal September, and the Triump of the Deletionists is a more common one. Even just going back two years, you'll find a surprising number of gaps and deletions and 404s. But the automation of memory-holing stuff is somewhat novel. For now, it's just Google... but Google has an odd way of making things standard, even beyond the typical 'pay to have their search engine be the default'.

((And unusually, might not be intentional-qua-impact. I expect this points to one of the secret sauces of Google's current search indexing.))

The overt implications of a world where the majority of discussions from the mid-1990s to 2009 only show up after they've been recently and publicly linked is trivial in a boring bullying sorta way, but the deeper ramifications are less pleasant. It's also not hard to see this as obfuscating the origins and discussions of communities that keep their day-to-day conversations behinds robots.txt or in private spaces.

h/t to ErosBlogBachus for bringing me to these topics, though I'll caution that their writing is primarily NSFW. See also discussion from Gwern regarding possible structures and further ramifications here.


Add a comment