Algorithm-KMeans

 view release on metacpan or  search on metacpan

examples/cleanup_directory.pl  view on Meta::CPAN

#  There should be no need to call this script ordinarily.

#  When the Algorithm::KMeans module creates new cluster files,
#  it automatically delete all previously created such files.
#  Such files are named ClusterX.dat for X starting with X = 0.
#  The files __temp_* are created by the visualization script.
#  However, when the program terminates properly, it should 
#  automatically delete those files.


unlink glob "cluster*.dat";

unlink glob "__temp_*";

lib/Algorithm/KMeans.pm  view on Meta::CPAN

        push @data_tags, $record_name;
    }
    $self->{_original_data} = \%data_hash;
    $self->{_data_id_tags} = \@data_tags;
    $self->{_N} = scalar @data_tags;
    if ($self->{_var_normalize}) {
        $self->{_data} =  variance_normalization( $self->{_original_data} ); 
    } else {
        $self->{_data} = deep_copy_hash( $self->{_original_data} );
    }
    # Need to make the following call to set the global mean and covariance:
    # my $covariance =  $self->estimate_mean_and_covariance(\@data_tags);
    # Need to make the following call to set the global eigenvec eigenval sets:
    # $self->eigen_analysis_of_covariance($covariance);
    if ( defined($self->{_K}) && ($self->{_K} > 0) ) {
        carp "\n\nWARNING: YOUR K VALUE IS TOO LARGE.\n The number of data " .
             "points must satisfy the relation N > 2xK**2 where K is " .
             "the number of clusters requested for the clusters to be " .
             "meaningful $!" 
                         if ( $self->{_N} < (2 * $self->{_K} ** 2) );
        print "\n\n\n";
    }
}

lib/Algorithm/KMeans.pm  view on Meta::CPAN

    if ($self->{_debug}) {
        print "\nDisplaying the Covariance Matrix for cluster:";
        display_matrix( $covariance );
    }
    return $covariance;
}

sub write_clusters_to_files {
    my $self = shift;
    my @clusters = @{$self->{_clusters}};
    unlink glob "cluster*.dat";
    foreach my $i (0..@clusters-1) {
        my $filename = "cluster" . $i . ".txt";
        print "\nWriting cluster $i to file $filename\n" if $self->{_terminal_output};
        open FILEHANDLE, "| sort > $filename"  or die "Unable to open file: $!";
        foreach my $ele (@{$clusters[$i]}) {        
            print FILEHANDLE "$ele ";
        }
        close FILEHANDLE;
    }
}

lib/Algorithm/KMeans.pm  view on Meta::CPAN


What to use for the QoC metric is obviously a critical issue unto itself.  In the
current implementation, the value of QoC is the ratio of the average radius of the
clusters and the average distance between the cluster centers.

Every iterative algorithm requires a stopping criterion.  The criterion implemented
here is that we stop iterations when there is no re-assignment of the data points
during the assignment step.

Ordinarily, the output produced by a K-Means clusterer will correspond to a local
minimum for the QoC values, as opposed to a global minimum.  The current
implementation protects against that when the module constructor is called with the
C<random> option for C<cluster_seeding> by trying different randomly selected initial
cluster centers and then selecting the one that gives the best overall QoC value.

A K-Means clusterer will generally produce good results if the overlap between the
clusters is minimal and if each cluster exhibits variability that is uniform in all
directions.  When the data variability is different along the different directions in
the data space, the results you obtain with a K-Means clusterer may be improved by
first normalizing the data appropriately, as can be done in this module when you set
the C<do_variance_normalization> option in the constructor.  However, as pointed out



( run in 0.530 second using v1.01-cache-2.11-cpan-49f99fa48dc )