On Bots

Contents


Introduction

In the previous edition - Binary Search Tree 2 - a large scale experiment on search engine behaviour was staged with more than two billion different web pages. This experiment lasted exactly one year, until April 13th. In this period the three major search engines requested more than one million pages of the tree, from more than hundred thousand different URLs. The home page of drunkmenworkhere.org grew from 1.6 kB to over 4 MB due to the visit log and the comment spam displayed there.

This edition presents the results of the experiment.

^


Setup

2,147,483,647 web pages ('nodes') were numbered and arranged in a binary search tree. In such a tree, the branch to the left of each node contains only values less than the node's value, while the right branch contains only values higher than the node's value. So the leftmost node in this tree has value 1 and the rightmost node has value 2,147,483,647.

The depth of the tree is the number of nodes you have to traverse from the root to the most remote leaf. Since you can arrange 2n+1 - 1 numbers in a tree of depth n, the resulting tree has a depth of 30 (231 = 2,147,483,648). The value at the root of the tree is 1073741824 (230).

For each page the traffic of the three major search bots (Yahoo! Slurp, Googlebot and msnbot) was monitored over a period of one year (between 2005-4-13 and 2006-4-13).

To make the content of each page more interesting for the search engines, the value of each node is written out in American English (short scale) and each page request from a search bot is displayed in reversed chronological order. To enrich the zero-content even more, a comment box was added to each page (it was removed on 2006-4-13). These measures were improvements over the initial Binary Search Tree which uses inconvenient long URLs.

Every node shows an image of three trees. Each tree in the image visualises which nodes are crawled by each search engine. Each line in the image represents a node, the number of times a search bot visited the node determines the length of the line. The tree images below are modified large versions of the original image, without the very long root node and with disconnected (wild) branches.

^


Overall results

From the start Yahoo! Slurp was by far the most active search bot. In one year it requested more than one million pages and crawled more than hundred thousand different nodes. Although this is a large number, it still is only 0.0049% of all nodes. The overall statistics of all bots is shown in the table below.

overall statistics by search engine
Yahoo! Google MSN
total number of pageviews 1,030,396 20,633 4,699
number of nodes crawled 105,971 7,556 1,390
percentage of tree crawled 0.0049% 0.00035% 0.000065%
number of indexed nodes 120,000 554 1
indexed/crawled ratio 113.23% 7.33% 0.07%

The growth of the number of pageviews and the number of crawled nodes over the year the experiment lasted, is shown in figure 1 and 2. The way the bots crawled the tree is visualised in detail with the animations for each bot in the sections below.

pageviews in time
Fig. 1 - The cumulative number of pageviews by the search bots in time.


nodes crawled in time
Fig. 2 - The cumulative number of nodes crawled by the search bots in time.


The graph below (fig. 3) shows how many nodes of each level of the tree were crawled by the bots (on a logarithmic scale). The root of the tree is at level 0, while the most remote nodes (e.g. node 1) are at level 30. Since there are 2n nodes at level n (there is only 1 root and there are 230 nodes at level 30) crawling the entire tree would result in a straight line.

nodes crawled by level
Fig. 3 - The number of nodes crawled after 1 year, grouped by node level.


Google closely follows this straight line, until it breaks down after level 12. Most nodes at level 12 or less were crawled (5524 out of 8191), but only very few nodes at higher levels were crawled by Googlebot. MSN shows similar behaviour, but breaks down much earlier, at level 9 (656 out of 1023 nodes were crawled). Yahoo, however, does not break down. At high levels it gradually fails to request all nodes.

The nodes at high levels that were crawled by Yahoo, were requested quite often compared to the other bots: at level 14 to 30 each page was requested 10 times at average (see fig. 4).

nodes crawled by depth
Fig. 4 - The average number of pageviews per node after 1 year, grouped by node level.

^


Yahoo! Slurp

Yahoo! binary tree

Fig. 5 - The Yahoo! Slurp tree.

Yahoo! Slurp was the first search engine to discover Binary Search Tree 2. In the first hours after discovery it crawled the tree vigorously, at a speed of over 2.3 nodes per second (see the short animation). The first day it crawled approximately 30,000 nodes.

In the following month Slurp's activity was low, but after exactly one month it requested all pages it visited before, for the second time. In the animation you can see the size of the tree double on 2005-05-14. This phenomenon is repeated a month later: on 2005-06-13 the tree grows to three times it original size. The number of pageviews is then almost 90,000 while the number of crawled nodes still is 30,000. Figure 6 shows this stepwise increment in the number of pageviews during the first months.

pageviews by Yahoo! Slurp
Fig. 6 - The cumulative number of pageviews by Yahoo! Slurp in time.


After four months Slurp requested a large number of 'new' nodes, for the first time since the initial round. It simply requested all URLs it had. Since it had already indexed 30,000 pages, that each link to two pages at a deeper level, it requested 60,000 pages at the end of August (the number of pageviews jumps from 100,000 to 160,000 pages in fig. 6) and it doubled the number of nodes it had crawled (see the fig. 7).

After 5 months Yahoo! Slurp started requesting nodes more regularly. It still had periods of 'discovery' (e.g. after 10 months).

nodes crawled by Yahoo! Slurp
Fig. 7 - The cumulative number of nodes crawled by Yahoo! Slurp in time.

Yahoo reported 120,000 pages in it's index (current value). This may seem impossible since it only visited 105,971 nodes, but every node is available on two different domain names: www.drunkmenworkhere.org and drunkmenworkhere.org.

Note: the query submitted to Google and MSN yielded 35,600 pages on Yahoo. Yahoo is the only search engine that returns results with the query used above.

^


Googlebot

Google binary tree

Fig. 8 - The Googlebot tree.

In comparison with Yahoo's tree, Google's tree looks more like a natural tree. This is because Google visited nodes at deeper levels less frequently than their parent nodes. Yahoo only visited the nodes at the first three levels more frequently, while Google did so for the first 12 levels (see fig. 4).

The form of the tree follows from Google's PageRank algorithm. PageRank is defined as follows:

"We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) "

Since most nodes in the tree are not linked to by other sites, the PageRank of a node can be calculated with this formula (ignoring links in the comments):

PR(node) = 0.15 + 0.85 (PR(parent) + PR(left child) + PR(right child))/3

The only unknown when applying this formula iteratively, is the PageRank of the root node of the tree. Since this node was the homepage of drunkmenworkhere.org for a year, a high rank may be assumed. The calculated PageRank tree (fig. 9) shows similar proportions as Googlebot's real tree, so the frequency of visiting a page seems to be related to the PageRank of a page.

PageRank of pages in a binary tree
Fig. 9 - A binary tree of depth 17 visualising calculated PageRank as length of each line, when the PageRank of the root node is set to 100.


The animation of the Googlebot tree shows some interesting erratic behaviour, that cannot be explained with PageRank.

The rightmost branch
From the start Googlebot crawled more nodes on the right hand side of the tree. On 2005-07-04 it tries to visit the rightmost node, i.e. the node with the highest value. After selecting the right branch starting at the root for 20 levels Googlebot stopped. This produced the arc at the right end of the tree. Googlebot's rightmost branch
Searching node 1
On 2005-06-30 Googlebot visited node 1, the leftmost node. It did not crawl the path from the root to this node, so how did it find the page? Did it guess the URL or did it follow some external link?
A few hours later, Googlebot crawled node 2, which is linked as a parent node by node 1. These two nodes are displayed as a tiny dot in the animation on 2005-06-30, floating above the left branch. Then, a week later, on 2005-07-06 (two days after the attempt to find rightmost node), between 06:39:39 and 06:39:59 Googlebot finds the path to these disconnected nodes by visiting the 24 missing nodes in 20 seconds. It started at the root and found it's way up to node 2, without selecting a right branch. In the large version of the Googlebot tree, this path is clearly visible. The nodes halfway the path were not requested for a second time and are represented by thin short line segments, hence the steep curve.
Googlebot's path to node 1
Yahoo-like subtree
On 2005-07-23 Google suddenly spends some hours crawling 600 new nodes near node 1073872896. Most of these nodes were not visited ever again.
This subtree is the reason the number of nodes crawled by Googlebot, grouped by level, increases again from level 18 to level 30 in fig. 3.
Googlebot's subtree

Over the last six months Googlebot requested pages at a fixed rate (about 260 pages per month, fig. 10). Like Yahoo! Slurp it seems to alternate between periods of discovery (see fig. 11) and periods of refreshing it's cache.

pageviews by Googlebot

Fig. 10 - The cumulative number of pageviews by Googlebot in time.


nodes crawled by Googlebot
Fig. 11 - The cumulative number of nodes crawled by Googlebot in time.


Google returned 554 results when searching for nodes. The first nodes reported by Google are node 1 and 2, which are very deep inside the tree at level 29 and 30. Their higher rank is also reflected in the curve shown above (Searching node 1), which indicates a high number of pageviews. They probably appear first because of their short URLs. The other nodes at the first result page are all at level 4, probably because the first three levels are penalised because of comment spam. The current number of results can be checked here.

^


MSNbot

MSN binary tree

Fig. 12 - The msnbot tree


The Msnbot tree is much smaller than Yahoo's and Google's. The most interesting feature is the disconnected large branch to the right of the tree. It appears on 2005-04-29, when msnbot visits node 2045877824. This node contains one comment, posted two weeks before:

I hereby claim this name in the name of...well, mine. Paul Pigg.

A week before msnbot requested this node, Googlebot already visited this node. This random node at level 24 was crawled because of a link from Paul Pigg's website masterpigg.com (now dead, Google cache). All three search engines visited the node via this link, and all three failed to connect it to the rest of the tree. You can check this by clicking the 'to trunk' links starting at node 2045877824.

Msnbot crawled from the disconnected node in upward and downward direction, creating a large subtree. This subtree caused the upward line between level 18 and 30 in figure 3.

The second large disconnected branch, at the top in the middle, originated from a link on uu-dot.com. Both disconnected branches are clearly visible in the Googlebot tree as well.

pageviews by msnbot

Fig. 13 - The cumulative number of pageviews by msnbot in time.


nodes crawled by msnbot
Fig. 14 - The cumulative number of nodes crawled by msnbot in time.


As the graphs above show, msnbot virtually ceased to crawl Binary Search Tree 2 after five months. How the number of results MSN Search returns, relates to the above graphs is unclear.

^


Spam bots

In one year 5265 comments were posted to 103 different nodes. 32 of these nodes were never visited by any of the search bots. Most comments (3652) were posted to the root node (the home page). The word frequency of the submitted comments was calculated.

top 50 of most frequently spammed words
countword
1 32743http
2 23264com
3 12375url
4 8636www
5 5541info
6 4631viagra
7 4570online
8 4533phentermine
9 4512buy
10 4469html
11 3531org
12 3346blogstudio
13 3194drunkmenworkhere
14 2801free
15 2772cialis
16 2371to
17 2241u
18 2169generic
19 2054cheap
20 1921ringtones
21 1914view
22 1835a
23 1818net
24 1756the
25 1658buddy4u
26 1633of
27 1633lelefa
28 1580xanax
29 1572blogspot
30 1570tramadol
31 1488mp3sa
32 1390insurance
33 1379poker
34 1310cgi
35 1232sex
36 1198teen
37 1193in
38 1158content
39 1105aol
40 1099mime
41 1095and
42 1081home
43 1034us
44 1022valium
45 1020josm
46 1012order
47 992is
48 948de
49 908ringtone
50 907i

complete list (360 kB)

As the top 50 clearly shows, most spam was related to pharmaceutical products. The pie chart below shows the share of each medicine.

spammed pharmaceuticals
Fig. 15 - The share of various medicines in comment spam.


Submitted domain names were filtered from the text. All top-level domain names are shown in figure 16, ordered by frequency.

spam by tld

Fig. 16 - Number of spammed domains by top level domain


Many email addressses submitted by the spam bots were non-existing addresses @drunkmenworkhere.org, which explains the high rank of this domain in the chart of most frequently spammed domains (fig. 17).

spam by tld

Fig. 17 - Most frequently spammed domains


^