Friday, May 04, 2007

How To Ask Alexa Crawler To Crawl Your Site.

This is an important list for you to make sure your site will be crawl by Alexa. The Alexa crawler (robot), which identifies itself as ia_archiver in the HTTP “User-agent” header field, uses a web-wide crawl strategy.

Basically, it starts with a list of known URLs from across the entire Internet, then it fetches local links found as it goes. There are several advantages to this approach, most importantly that it creates the least possible disruption to the sites being crawled.Alexa will not index anything you would like to remain private. All you have to do is tell them.

By using the Standard for Robot Exclusion (SRE).
The SRE was developed by Martijn Koster at Webcrawler to allow content providers to control how robots behave on their sites. All of the major Web-crawling groups, such as AltaVista, Inktomi, and Google, respect this standard.Alexa Internet strictly adheres to the standard:
The Alexa crawler looks for a file called “robots.txt”.
Robots.txt is a file website administrators can place at the top level of a site to direct the behavior of web crawling robots.

The Alexa crawler will always pick up a copy of the robots.txt file prior to its crawl of the Web. If you change your robots.txt file while we are crawling your site, please let us know so that we can instruct the crawler to retrieve the updated instructions contained in the robots.txt file.

So please use this tips:

To exclude all robots, the robots.txt file should look like this:
User-agent: *Disallow: /

To exclude just one directory (and its subdirectories), say, the /images/ directory, the file should look like this:
User-agent: *Disallow: /images/

Web site administrators can allow or disallow specific robots from visiting part or all of their site. Alexa’s crawler identifies itself as ia_archiver, and so to allow ia_archiver to visit (while preventing all others), your robots.txt file should look like this:
User-agent: ia_archiverDisallow:
To prevent ia_archiver from visiting (while allowing all others), your robots.txt file should look like this:
User-agent: ia_archiverDisallow: /

That’s all.


Keith said...

ia_archiver has 'accidentally' crawled some private (password protected!!!) admin pages on one of my sites - pages that cause the deletion of stock!

It looks to me like ia_archiver not only crawls pages that it finds links to, but also pages it discovers through the toolbar.

Don't rely on robots not finding files you don't want them to see - block them with the robots.txt file as a precaution!

Anonymous said...

thats all very well but HOW do you get it to recrawl your site ?
My site is


Blog Archive

template by