Which web search engines have their own index? Evidence from website visits

by | Published | Last updated

This article presents statistics for web page requests for cambridgeclarion.org where the User Agent string purported to be that of a search engine spider (also called a crawler or bot). Such statistics provide some evidence of which search engines are maintaining an independent index using their own crawler and the level of activity of that crawler.

These statistics are for a two month period ending on 9 September 2021 and do not include requests for image files.

IP addresses were not logged or checked in this exercise, so there is no proof that the declared User Agent string is genuine. However, the absence of a request with an official User Agent string can demonstrate a negative, namely that a crawler from a particular search engine did not make a page request. For example, no User Agent string that could be associated with Brave was seen, and neither was one for Gigablast, during the entire period.

In general these collected statistics put a reliable upper bound on the number of page requests from the official crawler of each search engine, if not a lower one.

Table 1 shows the number of page requests associated with each search engine. A page is defined here to mean an HTML page, a PDF file or a plain text file, so requests for images and anything else embedded in a page were not counted. It also excludes requests for favicon.ico, robots.txt and the non-existent ads.txt.

Table 1: The number of requests for web pages of cambridgeclarion.org made by the crawlers from various search engines as identified by User Agent string in a period of 62 days ending 9 November 2021
Search engine Number of requests Requests relative to Googlebot Unique pages requested Unique pages relative to Googlebot
Baidu 159 5.80% 73 17.98%
Bing 4,188 152.79% 816 200.99%
Coccoc 83 3.03% 2 0.49%
Google 2,741 100% 406 100%
Infotiger 261 9.52% 257 63.30%
Marginalia 1,344 49.03% 741 182.51%
Mojeek 157 5.73% 149 36.70%
Naver 21 0.77% 15 3.69%
Neeva 25 0.91% 21 5.17%
Petal 4,959 180.92% 899 221.43%
Qwant 323 11.78% 313 77.09%
Seznam 678 24.74% 347 85.47%
Sogou 384 14.01% 14 3.45%
Yandex 1,101 40.17% 672 165.52%

Table 2 shows the observed User Agent strings (truncated to a maximum of 189 characters) that contributed to the counts in Table 1.

Some User Agents strings seen in requests do not appear in Table 2 because they did not contribute to the counts. This was the case, for example, when only requests for image files and robots.txt were received with a particular User Agent string. One string has been omitted on the grounds that it was likely fake. It does not match any of the Googlebot strings as presented at https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers in May 2022. None of those included “iPhone”.

This string, used in a total of 18 requests, is as follows:

Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bo

Table 2: User Agent strings that were seen in requests for pages at cambridgeclarion.org and contributed to the counts in Table 1.
Search engine Observed User Agent string
Baidu Mozilla/5.0 (Linux;u;Android 4.2.2;zh-cn;) AppleWebKit/534.46 (KHTML,like Gecko) Version/5.1 Mobile Safari/10600.6.3 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Baidu Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)
Baidu Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Bing Mozilla/5.0 (compatible; Bingbot/2.0; +http://www.bing.com/bingbot.htm)
Bing Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Bing Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbo
Coccoc Mozilla/5.0 (compatible; coccocbot-web/1.0; +http://help.coccoc.com/searchengine)
Google Googlebot-Image/1.0
Google Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.119 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com
Google Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.80 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/
Google Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/
Google Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.83 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/
Google Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.85 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/
Google Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.89 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/
Google Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/
Google Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/
Google Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/
Google Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/
Google Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/
Google Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/
Google Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Infotiger Mozilla/5.0 (compatible; InfoTigerBot/1.9; +https://infotiger.com/bot)
Marginalia search.marginalia.nu
Mojeek Mozilla/5.0 (compatible; MojeekBot/0.10; +https://www.mojeek.com/bot.html)
Naver Mozilla/5.0 (compatible; Yeti/1.1; +http://naver.me/spd)
Neeva Mozilla/5.0 (compatible; Neevabot/1.0; +https://neeva.com/neevabot)
Petal Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)
Petal Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot)
Qwant Mozilla/5.0 (compatible; Qwantify/1.0; +https://www.qwant.com/)
Qwant Mozilla/5.0 (compatible; Qwantify/2.4w; +https://www.qwant.com/)
Seznam Mozilla/5.0 (compatible; SeznamBot/3.2-test1; +http://napoveda.seznam.cz/en/seznambot-intro/)
Seznam Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)
Seznam Mozilla/5.0 (compatible; SeznamBot/4.0-RC1; +http://napoveda.seznam.cz/seznambot-intro/)
Sogou Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
Yandex Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.268
Yandex Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Appendix

Table 3 shows the public web search page for some of these search engines.

Table 3: Public web search pages available in September 2022 from some of the search engines in Table 1.
Search engine Public search page
Infotiger https://infotiger.com
Marginalia https://search.marginalia.nu
Mojeek https://www.mojeek.co.uk
Seznam https://search.seznam.cz

Related