Any web page has useful text/content and additional elements like navigation, advertisenemnt , footer , header and so on.
Separation useful text can be very helpful in automatic webpage processing or web mining.
Having several web pages from the same website this task can be accomplished
by comparing the web pages. Useful text has two features that can be
used for separation: it's located in one place (could be more than one place) and it's unique to each
page.
So the frequency of words within the useful text on all pages from given web site should
be very small since the text on each page is different. Of course common words like
the, in ,a will have high occurence but it can be resolved.
For example by counting frequency of 2 or 3 words, looking for previous text.
Here is the perl script for this.
Thus this method and included perl script allow to extract useful text from web pages of web site.
As prefiltering step this will also improve quality of further processing.