Applied Math & Computer Science Lab
Data Analysis, Optimization & Mathematical Modeling, Artificial Intelligence, Neural Net For Everyday Life Applications
Artificial Intelligence/Data Mining Links Webmaster Resources AMCSL Forum: Web Mining Submit Link New Additions Archive Consulting Service
Products      Clickstream Miner   
Search the Web:    


Extracting Useful Text From Website

  Any web page has useful text/content and additional elements like navigation, advertisenemnt , footer , header and so on. Separation useful text can be very helpful in automatic webpage processing or web mining.
Having several web pages from the same website this task can be accomplished by comparing the web pages. Useful text has two features that can be used for separation: it's located in one place (could be more than one place) and it's unique to each page.
So the frequency of words within the useful text on all pages from given web site should be very small since the text on each page is different. Of course common words like the, in ,a will have high occurence but it can be resolved. For example by counting frequency of 2 or 3 words, looking for previous text.
Here is the perl script for this.
Thus this method and included perl script allow to extract useful text from web pages of web site. As prefiltering step this will also improve quality of further processing.



Source Code

1. Perl script for extracting useful text from web site