Visiting the website by web spider for crawling in general is good thing. However when
it come to data analysis the number of robots can make data not valid.
If someone is using raw data for data analysis the web spider visits should be excluded.
The list of ip addresses of search engines spiders can be found at http://www.iplists.com
While this list is big and include most of robots there are still some that will not be included.
One of the way to detect new robot is look at number of pageviews for each ip.
In many situations if the robot visit the website regularly and crawl most of the pages then this number
will be much higher than the number of pageviews per ip by human.
So I created the perl script to iterate through log, exclude already known search engine spiders or own ips from special text file and
count the number of visits per ip. It also counts the number of views per page.
The format of weblog is not server defined.
Having the summary of pageviews per ip will allow detect and then exclude new robots. And this
will make the data analysis more useful and accurate.