What’s in a number?

Tuesday, September 6, 2005
Posted by Richard Zwicky @ 2:22 pm

As everyone with half a brain online already knows, the search engines are almost falling over each other in the race to provide the best results to their customers. While I was in San Jose at the Search Engine Strategies conference in August, Yahoo! announced that their index had just reached 20 billion documents.

Ever since then, there has been a ton of speculation around their claim to have the biggest index. A Google engineer I had the chance to listen to while at Google’s offices commented that Google is trying to remove all duplicate content from their database, which means that as it expands, it’s also being paired down. His suggestion: Perhaps Yahoo! was counting with the duplication filters turned off. ( ? )

If Yahoo! were counting all documents, without filtering for duplication, then they could have a much larger index than others. But is it any good? I wrote an article called “The Google Dance” a few years ago, and then updated it occasionally until early last year. It’s been reprinted a few hundred times that I know of, and translated into many languages and reprinted from there. I remember when it first was published, and a few days later I saw it in Russian, Japanese, Korean, French, Spanish and German. I was blown away, (and still am). It also got reprinted many many times in English. From what I understood, Google is trying to remove all those duplicate copies, and just leave the original or source article available. Makes sense – If someone is searching for information on “the Google Dance” they don’t want to wade through many copies of the same document.

So, is Yahoo!’s number really just the total number of docs they review in their index, but maybe not all the ones they list? Perhaps. For fun, here’s a comparison on the total number of entries for the term “Google Dance” in MSN, Yahoo and Google.

Google: 4,200,000
MSN: 3,369,811
Yahoo: 14,000,000

If Google is searching through 8,168,684,336 web pages, and we assume that the Google Dance is an average representation for content reprints, then Yahoo’s index would be 2.25x the size of Google’s. That would give them an index of 18B. Still a huge number.

Does Yahoo! really have 20 billion documents in their index? Possibly. But if they are counting every copy of the Google Dance article as an individual document, then these are not 20 billion meaningful documents.

Does it matter?

Nope. What matters is which search engine has the biggest index of documents people care about.

Having the biggest overall index isn’t worth much if people can’t find what they are looking for.

Having the right documents; That’s really all the search engines should care about.

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment