Which web search engines have their own index? Evidence from website visits
by Stephen Hewitt | Published | Last updated
This article presents statistics for web page requests for cambridgeclarion.org where the User Agent string purported to be that of a search engine spider (also called a crawler or bot). Such statistics provide some evidence of which search engines are maintaining an independent index using their own crawler and the level of activity of that crawler.
These statistics are for a two month period ending on 9 September 2021 and do not include requests for image files.
IP addresses were not logged or checked in this exercise, so there is no proof that the declared User Agent string is genuine. However, the absence of a request with an official User Agent string can demonstrate a negative, namely that a crawler from a particular search engine did not make a page request. For example, no User Agent string that could be associated with Brave was seen, and neither was one for Gigablast, during the entire period.
In general these collected statistics put a reliable upper bound on the number of page requests from the official crawler of each search engine, if not a lower one.
Table 1 shows the number of page requests associated with each search engine. A page is defined here to mean an HTML page, a PDF file or a plain text file, so requests for images and anything else embedded in a page were not counted. It also excludes requests for favicon.ico, robots.txt and the non-existent ads.txt.
Search engine | Number of requests | Requests relative to Googlebot | Unique pages requested | Unique pages relative to Googlebot |
---|---|---|---|---|
Baidu | 159 | 5.80% | 73 | 17.98% |
Bing | 4,188 | 152.79% | 816 | 200.99% |
Coccoc | 83 | 3.03% | 2 | 0.49% |
2,741 | 100% | 406 | 100% | |
Infotiger | 261 | 9.52% | 257 | 63.30% |
Marginalia | 1,344 | 49.03% | 741 | 182.51% |
Mojeek | 157 | 5.73% | 149 | 36.70% |
Naver | 21 | 0.77% | 15 | 3.69% |
Neeva | 25 | 0.91% | 21 | 5.17% |
Petal | 4,959 | 180.92% | 899 | 221.43% |
Qwant | 323 | 11.78% | 313 | 77.09% |
Seznam | 678 | 24.74% | 347 | 85.47% |
Sogou | 384 | 14.01% | 14 | 3.45% |
Yandex | 1,101 | 40.17% | 672 | 165.52% |
Table 2 shows the observed User Agent strings (truncated to a maximum of 189 characters) that contributed to the counts in Table 1.
Some User Agents strings seen in requests do not appear in Table 2 because they did not contribute to the counts. This was the case, for example, when only requests for image files and robots.txt were received with a particular User Agent string. One string has been omitted on the grounds that it was likely fake. It does not match any of the Googlebot strings as presented at https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers in May 2022. None of those included “iPhone”.
This string, used in a total of 18 requests, is as follows:
Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bo
Search engine | Observed User Agent string |
---|---|
Baidu | Mozilla/5.0 (Linux;u;Android 4.2.2;zh-cn;) AppleWebKit/534.46 (KHTML,like Gecko) Version/5.1 Mobile Safari/10600.6.3 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) |
Baidu | Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html) |
Baidu | Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) |
Bing | Mozilla/5.0 (compatible; Bingbot/2.0; +http://www.bing.com/bingbot.htm) |
Bing | Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) |
Bing | Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbo |
Coccoc | Mozilla/5.0 (compatible; coccocbot-web/1.0; +http://help.coccoc.com/searchengine) |
Googlebot-Image/1.0 | |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.119 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com | |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.80 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/ | |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/ | |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.83 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/ | |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.85 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/ | |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.89 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/ | |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/ | |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/ | |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/ | |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/ | |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/ | |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/ | |
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | |
Infotiger | Mozilla/5.0 (compatible; InfoTigerBot/1.9; +https://infotiger.com/bot) |
Marginalia | search.marginalia.nu |
Mojeek | Mozilla/5.0 (compatible; MojeekBot/0.10; +https://www.mojeek.com/bot.html) |
Naver | Mozilla/5.0 (compatible; Yeti/1.1; +http://naver.me/spd) |
Neeva | Mozilla/5.0 (compatible; Neevabot/1.0; +https://neeva.com/neevabot) |
Petal | Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) |
Petal | Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot) |
Qwant | Mozilla/5.0 (compatible; Qwantify/1.0; +https://www.qwant.com/) |
Qwant | Mozilla/5.0 (compatible; Qwantify/2.4w; +https://www.qwant.com/) |
Seznam | Mozilla/5.0 (compatible; SeznamBot/3.2-test1; +http://napoveda.seznam.cz/en/seznambot-intro/) |
Seznam | Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/) |
Seznam | Mozilla/5.0 (compatible; SeznamBot/4.0-RC1; +http://napoveda.seznam.cz/seznambot-intro/) |
Sogou | Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07) |
Yandex | Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.268 |
Yandex | Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) |
Appendix
Table 3 shows the public web search page for some of these search engines.
Search engine | Public search page |
---|---|
Infotiger | https://infotiger.com |
Marginalia | https://search.marginalia.nu |
Mojeek | https://www.mojeek.co.uk |
Seznam | https://search.seznam.cz |