Document clustering is one of the common tasks in text mining. The goal of document clustering is
divide documents into groups of similar documents. Perl script implemented here will cluster text files
from specified folder.
The program looks at this folder and extracts file names. Then for each file it extracts each word and builds
two dimensional data array: data[document index][word index]. Another array docs[filename index] keeps track of
filenames.
After this the data array is used for clustering. This script is using k-means clustering algorithm.
The links for source code and descriptions of this algorithm and some other are provided below.
The clustering algorithm requires the number of clusters to be set in the beginning. As for the experiment
with this script I put in the folder 20 text files some of them are perl script files, and some just text
files. The number of clusters I set is 3. The result cluster has all perl scripts plus 3 more text documents.
Thus we implemeted document clustering. We used k-means clustering algorithm but we have also some other
implemented algorithms that can be easy plugged to this script. We skipped the data preprocessing step. Such actions
like eliminating stop words, tokenization, processing mutiple forms of the words with the same root can improve
significantly clustering result. It will be considered in the next script.