SEO Blog

Tuesday, May 27, 2008

Google Diving Deeper - Googlebot Fills Out Forms

Posted by Jim Hedger @ 12:04 pm
Bookmark & Share:
del.icio.us  |  digg.com  |  Reddit  |  StumbleUpon  |  Sphinn  |  Slashdot  |  Technorati  |  ISEdb Scoop  |  Google  |  My Yahoo  |  Windows Live  |  Ask

Google’s page-finding spider, Googlebot, has developed a new talent over the past six months that draws it deeper into areas of the web that were previously inaccessible to search engine spiders. Googlebot can now fill in and submit the simplest type of web-forms and spider through the content hidden behind them. When Googlebot finds useful or interesting information behind web-forms, it adds that content to its index.

Reports of Googlebot working its way through forms started surfacing in November 2007 though the feasibility of the form-filling indexing technique was written up in Bill Slawski’s SEO by the Sea blog in October 2006. It was confirmed by a post to the Official Google Webmaster Central Blog, “Crawling through HTML forms” in April 2008.

Google’s venture into the Invisible Web is extremely interesting though in some cases possibly perilous. An enormous amount of data is, (or was), hidden from search spiders and therefore “invisible” to the general public. While nobody really knows the depth of the Invisible Web, it is estimated to be several times the size of the Visible Web. “Known (or Charted) Cyberspace” is about to get very much larger.

According to a teachers’ guide published by UC Berkeley, “Invisible or Deep Web: What it is, Why> it exists, How to find it, and Its inherent ambiguity“,

Remember that the Invisible Web exists. In addition to what you find in search engine results (including Google Scholar) and most web directories, there are other gold mines you have to search directly. This includes all of the licensed article, magazine, reference, news archives, and other research resources that libraries and some industries buy for those authorized to use them. The contents of these are not freely available: libraries and corporations buy the rights for their authorized users to view the contents. If they appear free, it’s because you are somehow authorized to search and read the contents (library card holder, member of the company, etc.).

What kind of content is Google going after? Google’s core mission is, “… organize the world’s information and make it universally accessible and useful.” It’s hard to say exactly what Google is looking to index but a reasonable guess would be, all it can get. Google stresses it only targets simple web forms that use a GET request and avoids more complicated, data-saving forms which tend to use POST requests.

Simple web forms are used to gain general non-identifying information about website visitors, often in order to provide a more personalized experience to them. For instance, a drop-down menu might ask a visitor which state or province they reside in or enter their zip or postal code into a text-box beside a SUBMIT button in order to present regionally relevant information. Webmaster best-practices say simple web forms should be programmed as GET requests as opposed to the data-saving POST request that often triggers the retention of user-entered information in a database or creates new on-the-fly page-content based on information entered by the user.

According to Google search engineer Matt Cutts, “… it’s less about crawling search results and more about discovering new links. A form can provide a way to discover different parts of a site to crawl. The team that worked on this did a really good job of finding new urls while still being polite to webservers.”

Google’s Webmaster Central Blog states its spider respects ROBOT.TXT, NOINDEX and NOFOLLOW directives. There’s still a lot of bugs to work out. Recent posts to search marketing blogs and forums tell of Google creating long lists of duplicate results (with very minor differences such as addresses or regions) stemming from the repeated filling in of forms and indexing the resulting content. Others post about Google making personal content available (though those results likely stem from a flaw in web-design or lack of ROBOTS.TXT)

Website designers need to be careful to protect client and site-user data when creating forms by either using the more complicated POST request or by inserting spider directives such as ROBOTS.TXT files into their documents. SEOs need to be extra vigilant when working on client code to ensure that these directives are in place when necessary.

At the same time, very clever SEOs are likely to find ways to benefit from Googlebot’s newest confirmed behaviour. Watch for the obvious signs, especially in industries such as finance, mortgage and tourism.

3 Comments »

  1. You have given a wonderful information about the googlebot. I don’t know about this tool. This will certainly makes our job easier.

    Comment by keaton — Tuesday, May 27, 2008 @ 11:56 pm

  2. Ahh googlebot. We have to add ‘increasing average stupidity’ to your ever increasing pool of talents. You are really dumbing down gen Y-ers, thanks to your drivel you keep serving up in your searches.

    http://justtofun.blogspot.com/2008/05/fast-food-information-ala-google.html

    Comment by Ninja — Wednesday, June 4, 2008 @ 6:34 pm

  3. Cheers - have been looking for some info on this to send to my technical department. Can see websites using GET requests rather than POST requests on their forms to get their pagebloat up. Do the form submissions count at links - if so some sites could find their internal PageRank collapsing as their form now make a couple of hundred links appear - runing any PR sculpting they’ve done. Am going to have to look into this a bit more - before I start trying to work out what to do about it.

    Comment by Mike — Monday, June 30, 2008 @ 12:35 am

RSS feed for comments on this post. TrackBack URI

Leave a comment

Find it!


May 2008
S M T W T F S
« Apr   Jun »
 123
45678910
11121314151617
18192021222324
25262728293031