What is a web crawler and how does it work?

Have you ever wondered how answers can be at our fingertips in the digital age? It seems impossibly convenient to be able to type a question into a search bar and receive a list of helpful resources.

This massive library of information, or index, is amassed using web crawlers.

When a user submits a query, search engine algorithms sort through the data in this index to return the most relevant results.

Crawlers apparently gained the name because they crawl through a site a page at a time, following the links to other pages on the site until all pages have been read.

They exist to discover, understand, and organize the internet's content in order to offer the most relevant results to the questions searchers are asking.

How do web crawlers work?

A web crawler, spider or bot starts with a list of URLs to visit, called the seeds.

Once it lands on a page, it follows all the links on that page and then does the same to each page it can find.

The content from these pages is fetched and analysed and a link extractor then parses the HTML and extracts all the links. All of this information is stored, prior to indexing.

The pages are then sorted by looking at the quality of the content and other factors. This includes images and voice search.

Search engines process and store information they find in an index, a huge database of all the content theyve discovered and deem good enough to serve up to searchers.

Relevant information is determined by algorithms specific to the crawlers, but typically include factors like the accuracy, rate, and location of keywords.

Whenever the crawler finds new links on a site, it adds them to the list of pages to visit next.

If it finds changes in the links or broken links, it will make a note of that so the index can be updated.

Examples of web crawlers

These examples of web crawlers primarily fall into two categories:

Commercial web crawlers

Googlebot is the generic name for Google's web crawler which includes a desktop crawler and a mobile crawler.
Bingbot is the name of Microsoft's Bing webcrawler. It replaced Msnbot.
Baiduspider is Baidu's web crawler.

Open Source web crawlers

GNU Wget is a command-line-operated crawler typically used to mirror Web and FTP sites.
Heritrix is the Internet Archive's crawler.
Apache Nutch is a highly extensible and scalable web crawler.

How often do pages get crawled?

Search engines have made some policies on the content to crawl, the order to crawl and the frequency to crawl, etc.

Google, for example, indexes billions of webpages and it would be impossible to crawl every page, every day.

URLs are crawled at different rates depending on how important the search engines deem the sites to be and how often content is updated on those sites.

So some URLs are crawled daily. Some URLs maybe weekly and other URLs every couple of months, maybe even every once every half a year.

How to manage what Google sees on your site

The best way to manage how the Googlebot crawler sees your site is by using Search Console.

This allows you to determine how Google processes pages on your site, it allows you to request a recrawl or to opt out of crawling altogether using a file called 'robots.txt'.

The number of Internet pages is extremely large; even the largest crawlers fall short of making a complete index.

So to get your site noticed, its important to update your site frequently and to take control of submitting changes to Google to ensure your content gets crawled.

This article was written by Gaz Hall, a UK based SEO Consultant on 6th April 2009.

Gaz has 20 years experience working on SEO projects large and small, locally and globally across a range of sectors.

If you need any SEO advice or would like him to look at your next project then get in touch to arrange a free consultation.