Applied Math & Computer Science Lab
Data Analysis, Optimization & Mathematical Modeling, Artificial Intelligence, Neural Net For Everyday Life Applications
AI/Data Mining Links Online Free Courses Online Bookstore AMCSL Forum Submit Link New Additions Archive
Practical Data Mining Courses      Get Certificate of Completion Now for Free   
Search the Web:    

Document Clustering with Perl Script

Document clustering is one of the common tasks in text mining. The goal of document clustering is divide documents into groups of similar documents. Perl script implemented here will cluster text files from specified folder.
The program looks at this folder and extracts file names. Then for each file it extracts each word and builds two dimensional data array: data[document index][word index]. Another array docs[filename index] keeps track of filenames.
After this the data array is used for clustering. This script is using k-means clustering algorithm. The links for source code and descriptions of this algorithm and some other are provided below. The clustering algorithm requires the number of clusters to be set in the beginning. As for the experiment with this script I put in the folder 20 text files some of them are perl script files, and some just text files. The number of clusters I set is 3. The result cluster has all perl scripts plus 3 more text documents.
Thus we implemeted document clustering. We used k-means clustering algorithm but we have also some other implemented algorithms that can be easy plugged to this script. We skipped the data preprocessing step. Such actions like eliminating stop words, tokenization, processing mutiple forms of the words with the same root can improve significantly clustering result. It will be considered in the next script.

References

1. Document clustering, souce code
2. k-means algorithm (souce code)
3. K-means clustering
4. Hierarchical clustering