The Pre-processing Steps for Text Document Mining with Perl
Introduction
In previous work [1],[3] perl script for document clustering was created but none of pre-processing steps was
considered. This work will investigate some pre-processing steps that can be easy
added to the text mining process.
Stemming
The purpose of stemming is to reduce multiple forms of a word to a common form.
Perl has module Lingua::Stem for this. Sometimes this algorithm can produce something that we don't want.
For example in the test run text about business intelligence the algorithm stemmed 'business' to 'busi' and 'intelligence' to 'intellig'.
However there is a method that allows to change stemming for specific words:
If above code is added to script 'business' will be converted to 'business' and 'intelligence' to 'intelligence'.
Removing Stopwords
Removing stopwords is another common step in text preparation. You can find stopword list on some websites
or use Lingua::StopWords perl module. Stopwords are words like 'a', 'the' that are frequently used but
don't have any meaning by itself.
In some situations you may be want to build your own list of keywords. There is an article that describes automatic way to do this.[4]
Splitting the Document into Sentences
Splitting the text into sentences can be also part of the document preparation for further text processing. Perl has module Lingua::EN::Sentence for this task.[5] Here is the example
how the text document can be splitted into sentences and the keywords can be extracted
from each sentence. The input for this example is text document saved in variable .
Extracting Keywords from the Text Document
This was used in the above example. The module Lingua::EN::Keywords has keywords function which will show 5 top keywords that characterize the meaning of the document.
Conclusion
Perl has more modules for various processing steps and for many different languages.
The external links to some perl modules from the examples are provided below [5],[6].
The perl script with all examples above is provided below.[2] This script shows just very basic
pre-processing steps but with perl modules it can be easy modified to different specific needs.