Applied Math & Computer Science Lab
Data Analysis, Optimization & Mathematical Modeling, Artificial Intelligence, Neural Net For Everyday Life Applications
AI/Data Mining Links Webmaster Resources AMCSL Forum Submit Link New Additions Archive
Practical Data Mining Courses      Get Certificate of Completion Now for Free   
Search the Web:    
- site feed


Extracting Useful Text From Website

  Any web page has useful text/content and additional elements like navigation, advertisenemnt , footer , header and so on. Separation useful text can be very helpful in automatic webpage processing or web mining.
Having several web pages from the same website this task can be accomplished by comparing the web pages. Useful text has two features that can be used for separation: it's located in one place (could be more than one place) and it's unique to each page.
So the frequency of words within the useful text on all pages from given web site should be very small since the text on each page is different. Of course common words like the, in ,a will have high occurence but it can be resolved. For example by counting frequency of 2 or 3 words, looking for previous text.
Here is the perl script for this.
Thus this method and included perl script allow to extract useful text from web pages of web site. As prefiltering step this will also improve quality of further processing.



Source Code

1. Perl script for extracting useful text from web site