Saturday, August 05, 2006

Domain names and Filenames

To a spider,,, and
are different urls and, therefore, different pages. Surfers arrive at the site's home page whichever of the urls are used, but spiders see them as individual urls, and it makes a difference when working out the PageRank. It is better to standardize the url you use for the site's home page. Otherwise each url can end up with a different PageRank, whereas all of it should have gone to just one url.

If you think about it, how can a spider know the filename of the page that it gets back when requesting
? It can't. The filename could be index.html, index.htm, index.php, default.html, etc. The spider doesn't know. If you link to index.html within the site, the spider could compare the 2 pages but that seems unlikely. So they are 2 urls and each receives PageRank from inbound links. Standardizing the home page's url ensures that the Pagerank it is due isn't shared with ghost urls.

Example: Go to
UK Holidays and UK Holiday Accommodation
site - how's that for a nice piece of link text ;). Notice that the url in the browser's address bar contains "www.". If you have the Google Toolbar installed, you will see that the page has PR5. Now remove the "www." part of the url and get the page again. This time it has PR1, and yet they are the same page. Actually, the PageRank is for the unseen frameset page.

When this article was first written, the non-www URL had PR4 due to using different versions of the link URLs within the site. It had the effect of sharing the page's PageRank between the 2 pages (the 2 versions) and, therefore, between the 2 sites. That's not the best way to do it. Since then, I've tidied up the internal linkages and got the non-www version down to PR1 so that the PageRank within the site mostly stays in the "www." version, but there must be a site somewhere that links to it without the "www." that's causing the PR1.

Imagine the page, The index page contains links to several relative urls; e.g. products.html and details.html. The spider sees those urls as and Now let's add an absolute url for another page, only this time we'll leave out the "www." part - This page links back to the index.html page, so the spider sees the index pages as Although it's the same index page as the first one, to a spider, it is a different page because it's on a different domain. Now look what happens. Each of the relative urls on the index page is also different because it belongs to the domain. Consequently, the link stucture is wasting a site's potential PageRank by spreading it between ghost pages.

No comments: