Applied Math & Computer Science Lab
Data Analysis, Optimization & Mathematical Modeling, Artificial Intelligence, Neural Net For Everyday Life Applications
AI/Data Mining Links Online Free Courses Online Bookstore AMCSL Forum Submit Link New Additions Archive
Practical Data Mining Courses      Get Certificate of Completion Now for Free   
Search the Web:    

The Pre-processing Steps for Text Document Mining with Perl

Introduction

In previous work [1],[3] perl script for document clustering was created but none of pre-processing steps was considered. This work will investigate some pre-processing steps that can be easy added to the text mining process.

Stemming

The purpose of stemming is to reduce multiple forms of a word to a common form. Perl has module Lingua::Stem for this. Sometimes this algorithm can produce something that we don't want. For example in the test run text about business intelligence the algorithm stemmed 'business' to 'busi' and 'intelligence' to 'intellig'. However there is a method that allows to change stemming for specific words:
If above code is added to script 'business' will be converted to 'business' and 'intelligence' to 'intelligence'.

Removing Stopwords

Removing stopwords is another common step in text preparation. You can find stopword list on some websites or use Lingua::StopWords perl module. Stopwords are words like 'a', 'the' that are frequently used but don't have any meaning by itself.
In some situations you may be want to build your own list of keywords. There is an article that describes automatic way to do this.[4]

Splitting the Document into Sentences

Splitting the text into sentences can be also part of the document preparation for further text processing. Perl has module Lingua::EN::Sentence for this task.[5] Here is the example how the text document can be splitted into sentences and the keywords can be extracted from each sentence. The input for this example is text document saved in variable .

Extracting Keywords from the Text Document

This was used in the above example. The module Lingua::EN::Keywords has keywords function which will show 5 top keywords that characterize the meaning of the document.

Conclusion

Perl has more modules for various processing steps and for many different languages. The external links to some perl modules from the examples are provided below [5],[6]. The perl script with all examples above is provided below.[2] This script shows just very basic pre-processing steps but with perl modules it can be easy modified to different specific needs.

References


1. Document clustering, souce code
2. The pre-processing steps for text document mining with perl (examples source code)
3. Document Clustering with Perl Script

External Links


4. Automatically Building a Stopword List for an Information Retrieval System Rachel TszWai, Lo, Ben He, Iadh Ounis Department of Computing Science ,University of Glasgow
5. Module for splitting text into sentences
6. Lingua::EN::Keywords