Online Submission!

Open Journal Systems

Analysis of the Temporal Behaviour of Search Engine Crawlers at Web Sites

Jeeva Jose, P. Sojan Lal

Abstract


Web log mining is the extraction of web logs to analyze user behaviour at web sites. In addition to user information, web logs provide immense information about search engine traffic and behaviour. Search engine crawlers are highly automated programs that periodically visit the web site to collect information. The behaviour of search engines could be used in analyzing server load, quality of search engines, dynamics of search engine crawlers, ethics of search engines etc. The time spent by various crawlers is significant in identifying the server load as major proportion of the server load is constituted by search engine crawlers. A temporal analysis of the search engine crawlers were done to identify their behaviour. It was found that there is a significant difference in the total time spent by various crawlers. The presence of search engine crawlers at web sites on hourly basis was also done to identify the dynamics of search engine crawlers at web sites.


Full Text:

PDF

References


C. Lee Giles, Yang Sun and Issac G. Council, “Measuring the Web Crawler Ethics,” WWW2010, ACM, 2010, pp. 1101-1102.

Bhagwani J. and K. Hande, “Context Disambiguation in Web Search Results Using Clustering Algorithm”, International Journal of Computer Science and Communication, vol. 2, pp. 119-123.

Jeeva Jose, P. Sojan Lal, “A Forecasting Model for the Pages Crawled by Search Engine Crawlers at a Web Site”, International Journal of Computer Applications(IJCA), Vol 68,Issue 13, 2013, pp.19-24.

http://www.webconfs.com/what-is-robots-txt-article-12.php

Yang Sun,Ziming Zhuang and C. Lee Giles,” A Large- Scale Study of Robots.txt”, WWW2007, ACM, 2007, pp.1123–1124.

Dikaikos M.P, Athena S. and Loizos P.,”An Investigation of Web Crawler Behavior: Characterization and Metrics”, Computer Communications, Vol 28, 2005, pp.880-897.

Brin .S and Page.L, The Anatomy of a Large Scale Hypertextual Web Search Engine, In Proceedings of the 7th International WWW Conference, Elsevier Science, New York, 1998.

Sullivan D., “Webspin: Newsletter “ http://contentmarketingpedia.com/Marketing-Library/Search/industryNewsSeptA1.pdf

Vaughan L. and Thelwal M., “Search Engine Coverage Bias: Evidence and Possible causes”, Information Processing and Management, Vol 40, pp. 693-707.

Schwenke F. and Weideman M, “The Influence that JavaScript has on the visibility of a web site to search engines – a pilot study”, Informatics & Design Papers and Reports, Vol 11, pp. 1-10.

C. Lee Giles, Yang Sun and Issac G. Council, “Measuring the Web Crawler Ethics,” WWW2010, ACM, 2010, pp. 1101-1102.

D. Mican & D. Sitar-Taut,” Preprocessing and Content/ Navigational Pages Identification as Premises for an Extended Web Usage Mining Model Development”, Informatica Economica, 2009,vol. 13(4),pp.168-179.

A. H. M.Wahab,H.N.M.Mohd,F.H.Hanaf & M.F.M.Mohsin,” Data Pre-processing on Web Server Logs for Generalized Association Rules Mining Algorithm”,World Academy of Science, Engineering and Technology,2008, pp.190-197.

M.Spiliopoulou, ”Web Usage Mining for Web Site Evaluation”, Communications of the ACM, 2000.Vol..43(8), pp.127-134.

http://www.alexa.com/help/webmasters

http://www.webmasterworld.com/search_engine_spiders/4348357.htm

http://user-agent-string.info/list-of-ua/bot-detail?bot=bingbot

http://whatis.riskyinternet.com/what-is/web-robot/discoveryengine-robot-6142/

http://www.rhyolite.com/anti-spam/badbots.html

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1 78852

http://support.google.com/webmasters/bin/answer.py?hl=en&answer

=182072

http://www.majestic12.co.uk/projects/dsearch/

http://www.bing.com/blogs/site_blogs/b/webmaster/archive/

/08/10/crawl-delay-and-the-bing-crawler-msnbot.aspx

http://help.yahoo.com/help/us/ysearch/slurp

http://blocklistpro.com/content-scrapers/ahrefsbot-seo-spybots.html

Kruskal,W. H., Wallis, W. A.”Use of Ranks in one-criterion Variance analysis”, Journal of the American Statistical Association, 47(260), 1952, pp.583-621.

Paneerselvam, R.: Research Methodology. New Delhi: Prentice Hall of India Private Limited,2005.

Ortega, J., L. And Aguillo, I,” Differences between web sessions according to the origin of their visits”,.Journal of Informetrics, 4, 2010,pp. 331-337 .




DOI: http://dx.doi.org/10.6084/ijact.v2i6.375

Refbacks

  • There are currently no refbacks.