I logged into my Google Webmaster tools account yesterday to find a strange message waiting for me.

A message from Google Webmaster Tools

Initially I didn’t think too much of it. This year our site has gone from having around 30,000 pages listed in Google’s index to having over 1.5 million. That’s a lot of URLs in anyone’s language. However, Google did helpfully supply a list of URLs that it said were problematic, so I decided to take a closer look.

It turned out on closer inspection that somehow we’d managed to open a back door to the pages on the website with no search results on them, i.e. we were internally linking to potentially millions of pages that had little or no content.

I’ll hand you over to Dave to explain what happened and why:

Users can navigate RevaHealth.com in a variety of ways, including searching using text input and by narrowing their current search criteria by location, type of clinic, treatment, and so on.

The narrowing mechanism also allows Google to traverse the site. Each entry in our dropdown lists is a link to another search page.

For example on the Dublin Dentists page there are links to locations within Dublin, (/dentists/Ireland/county-dublin/portmarnock) and links to search pages for specific treatments (dentists/Ireland/county-dublin/fillings). Google crawls the website by following these links.

We don’t want Google (or users) to be able to select options or follow links from these dropdowns that will bring them to search pages with no results. Therefore, when the current search page is created, we determine which options would be valid. If we do not have any dentists in Portmarnock on the site, then we simply don’t include the link to /dentists/Ireland/county-dublin/portmarnock in the dropdown. Similarly, if there are no dentists in Dublin that perform fillings we don’t include links to that set of results either. This makes the search page more complex to build, but it’s vital. Otherwise we would invite google to index millions of empty pages (and make our narrowing far less useful).

After seeing Google’s message in the Webmaster Tools we realised that something we had done had allowed many empty search pages to be reached by GoogleBot – disaster! The search pages themselves seemed fine. However, we had recently added our narrowing functionality on to our clinics’ brochure pages also. While this should have behaved identically to the search pages, we found that it wasn’t. The navigation on brochures actually allowed some empty pages to be traversed.

Thankfully, we had known from the start that empty pages might be reached as clinics came and went from the database, and had added a “noindex” tag to the empty pages should they be found. So, even though Google was reaching them, they were not being added to the index. Unfortunatley, from there, we also linked to a map page which shows the clinics on a map. This wasn’t tagged with a “noindex”, so when Google would reach an empty search results page it would be ignored, but then the empty map page would be indexed. As googlebot did its busy work thousands and thousands were added until Google decided to warn us about them.

Fortunately, the bug, while hard to find, was very easy to fix. So out hearty thanks to Google, the newest member of our test team.

Luckily we identified this back door quite quickly and have blocked it off, but if we hadn’t known about this Google would probably have stopped indexing large portions of our site, which would have had a very serious effect on our traffic.

So, our rather obvious tip of the day is if you haven’t already signed up for Google’s Webmaster Tools, do it today here:

http://www.google.com/webmasters/tools/

Our other rather obvious tip is that Google is famous  when Google takes the time to tell you there’s a problem, there probably is a problem! Have you had any messages from Google lately, and have they helped you identify any potential problems?