Algorithm-LinearManifoldDataClusterer

 view release on metacpan or  search on metacpan

examples/example1.pl  view on Meta::CPAN

#use lib '../blib/lib', '../blib/arch';


##  example1.pl

##  Highlights:
##
##    ---  The data file contains 498 samples in three small clusters 
##         on the surface of a sphere
##
##    ---  Note the use of 0.001 for delta_reconstruction_error

use strict;
use Algorithm::LinearManifoldDataClusterer;


my $datafile = "3_clusters_on_a_sphere_498_samples.csv";

my $mask = "N111"; 

my $clusterer = Algorithm::LinearManifoldDataClusterer->new( 
                                    datafile => $datafile,
                                    mask     => $mask,
                                    K        => 3,     # number of clusters
                                    P        => 2,     # manifold dimensionality
                                    max_iterations => 15,
                                    cluster_search_multiplier => 2,
                                    delta_reconstruction_error => 0.001,
                                    terminal_output => 1,
                                    visualize_each_iteration => 1,
                                    show_hidden_in_3D_plots => 1,
                                    make_png_for_each_iteration => 1,
                );

$clusterer->get_data_from_csv();

my $clusters = $clusterer->linear_manifold_clusterer();

$clusterer->display_reconstruction_errors_as_a_function_of_iterations();

$clusterer->write_clusters_to_files($clusters);

$clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

# Now make a png image file that shows the final clusters:
$clusterer->visualize_clusters_on_sphere("final_clustering", $clusters, "png");

examples/example2.pl  view on Meta::CPAN


#use lib '../blib/lib', '../blib/arch';

##  example2.pl

##  Highlights:
##
##    ---  The data file contains 3000 samples in three large
##         clusters on the surface of a sphere
##
##    ---  Note the use of 0.012 for delta_reconstruction_error


use strict;
use Algorithm::LinearManifoldDataClusterer;

my $datafile = "3_clusters_on_a_sphere_3000_samples.csv";

my $mask = "N111";         # for sphericaldata.csv

my $clusterer = Algorithm::LinearManifoldDataClusterer->new( 
                                    datafile => $datafile,
                                    mask     => $mask,
                                    K        => 3,     # number of clusters
                                    P        => 2,     # manifold dimensionality
                                    max_iterations => 15,
                                    cluster_search_multiplier => 2,
                                    delta_reconstruction_error => 0.012,
                                    terminal_output => 1,
                                    visualize_each_iteration => 1,
                                    show_hidden_in_3D_plots => 0,
                                    make_png_for_each_iteration => 1,
                );

$clusterer->get_data_from_csv();

my $clusters = $clusterer->linear_manifold_clusterer();

$clusterer->display_reconstruction_errors_as_a_function_of_iterations();

$clusterer->write_clusters_to_files($clusters);

$clusterer->visualize_clusters_on_sphere("final_clustering", $clusters);

# Now make a png image file that shows the final clusters:
$clusterer->visualize_clusters_on_sphere("final_clustering", $clusters, "png");

examples/example3.pl  view on Meta::CPAN


#use lib '../blib/lib', '../blib/arch';

##  example3.pl

##  Highlights:
##
##    --- The data file contains 1000 samples in four small
##        clusters on the surface of a sphere
##
##    ---  Note the use of 0.002 for delta_reconstruction_error


use strict;
use Algorithm::LinearManifoldDataClusterer;

my $datafile = "4_clusters_on_a_sphere_1000_samples.csv";

my $mask = "N111";         # for sphericaldata.csv

my $clusterer = Algorithm::LinearManifoldDataClusterer->new( 
                                    datafile => $datafile,
                                    mask     => $mask,
                                    K        => 4,     # number of clusters
                                    P        => 2,     # manifold dimensionality
                                    cluster_search_multiplier => 2,
                                    max_iterations => 15,
                                    delta_reconstruction_error => 0.002,
                                    terminal_output => 1,
                                    visualize_each_iteration => 1,
                                    show_hidden_in_3D_plots => 1,
                                    make_png_for_each_iteration => 1,
                );

$clusterer->get_data_from_csv();

my $clusters = $clusterer->linear_manifold_clusterer();

$clusterer->display_reconstruction_errors_as_a_function_of_iterations();

$clusterer->write_clusters_to_files($clusters);

$clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

# Now make a png image file that shows the final clusters:
$clusterer->visualize_clusters_on_sphere("final_clustering", $clusters, "png");

examples/example4.pl  view on Meta::CPAN

##  Highlights:
##
##    ---  The main highlight here is the use of the auto_retry_clusterer()
##         method for automatically invoking the clusterer repeatedly 
##         should it fail on account of the Fail-First bias built into
##         the code.
##
##    ---  The data file contains 498 samples in three small clusters 
##         on the surface of a sphere
##
##    ---  Note the use of 0.001 for delta_reconstruction_error

use strict;
use Algorithm::LinearManifoldDataClusterer;


my $datafile = "3_clusters_on_a_sphere_498_samples.csv";

my $mask = "N111"; 

my $clusterer = Algorithm::LinearManifoldDataClusterer->new( 
                                    datafile => $datafile,
                                    mask     => $mask,
                                    K        => 3,     # number of clusters
                                    P        => 2,     # manifold dimensionality
                                    max_iterations => 15,
                                    cluster_search_multiplier => 1,
                                    delta_reconstruction_error => 0.001,
                                    terminal_output => 1,
                                    visualize_each_iteration => 1,
                                    show_hidden_in_3D_plots => 1,
                                    make_png_for_each_iteration => 1,
                );

$clusterer->get_data_from_csv();

my $clusters = $clusterer->auto_retry_clusterer();

$clusterer->display_reconstruction_errors_as_a_function_of_iterations();

$clusterer->write_clusters_to_files($clusters);

$clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

# Now make a png image file that shows the final clusters:
$clusterer->visualize_clusters_on_sphere("final_clustering", $clusters, "png");

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

    croak "\nYou have used a wrong name for a keyword argument " .
          "--- perhaps a misspelling\n" 
          if check_for_illegal_params(@params) == 0;
    bless {
        _datafile                     =>   $args{datafile} || croak("datafile required"),
        _mask                         =>   $args{mask}     || croak("mask required"),
        _K                            =>   $args{K}        || 0,
        _P                            =>   $args{P}        || 0,
        _terminal_output              =>   $args{terminal_output} || 0,
        _max_iterations               =>   $args{max_iterations} || 0,
        _delta_reconstruction_error   =>   $args{delta_reconstruction_error} || 0.001,
        _delta_normalized_error       =>   undef,
        _cluster_search_multiplier    =>   $args{cluster_search_multiplier} || 1,
        _visualize_each_iteration     =>   $args{visualize_each_iteration} == 0 ? 0 : 1,
        _show_hidden_in_3D_plots      =>   $args{show_hidden_in_3D_plots} == 0 ? 0 : 1,
        _make_png_for_each_iteration  =>   $args{make_png_for_each_iteration} == 0 ? 0 : 1,
        _debug                        =>   $args{debug} || 0,
        _N                            =>   0,
        _KM                           =>   $args{K} * $args{cluster_search_multiplier},
        _data_hash                    =>   {},
        _data_tags                    =>   [],
        _data_dimensions              =>   0,
        _final_clusters               =>   [],
        _auto_retry_flag              =>   0,
        _num_iterations_actually_used =>   undef,
        _scale_factor                 =>   undef,
        _data_tags_to_cluster_label_hash  => {},
        _final_reference_vecs_for_all_subspaces => [],
        _reconstruction_error_as_a_function_of_iteration => [],
        _final_trailing_eigenvec_matrices_for_all_subspaces => [],
        _subspace_construction_error_as_a_function_of_iteration => [],
    }, $class;
}

sub get_data_from_csv {
    my $self = shift;
    my $filename = $self->{_datafile} || die "you did not specify a file with the data to be clustered";
    my $mask = $self->{_mask};
    my @mask = split //, $mask;
    $self->{_data_dimensions} = scalar grep {$_ eq '1'} @mask;
    print "data dimensionality:  $self->{_data_dimensions} \n" if $self->{_terminal_output};

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

        display_mean_and_covariance($mean, $covariance) if $self->{_debug};
    }
    my @clusters = @$initial_clusters;
    display_clusters(\@clusters) if $self->{_debug};
    my $iteration_index = 0;
    my $unimodal_correction_flag;
    my $previous_min_value_for_unimodality_quotient;
    while ($iteration_index < $self->{_max_iterations}) {
        print "\n\n========================== STARTING ITERATION $iteration_index =====================\n\n"
            if $self->{_terminal_output};
        my $total_reconstruction_error_this_iteration = 0;
        my @subspace_construction_errors_this_iteration;
        my @trailing_eigenvec_matrices_for_all_subspaces;
        my @reference_vecs_for_all_subspaces;
        foreach my $cluster (@clusters) {
            next if @$cluster == 0;
            my ($mean, $covariance) = $self->estimate_mean_and_covariance($cluster);
            display_mean_and_covariance($mean, $covariance) if $self->{_debug};
            print "--------------end of displaying mean and covariance\n\n" if $self->{_debug};
            my ($eigenvecs, $eigenvals) = $self->eigen_analysis_of_covariance($covariance);
            my @trailing_eigenvecs =  @{$eigenvecs}[$self->{_P} .. $self->{_data_dimensions}-1];
            my @trailing_eigenvals =  @{$eigenvals}[$self->{_P} .. $self->{_data_dimensions}-1];
            my $subspace_construction_error = reduce {abs($a) + abs($b)} @trailing_eigenvals;
            push @subspace_construction_errors_this_iteration, $subspace_construction_error;
            my $trailing_eigenvec_matrix = Math::GSL::Matrix->new($self->{_data_dimensions}, 
                                                                 scalar(@trailing_eigenvecs));
            foreach my $j (0..@trailing_eigenvecs-1) {
                print "trailing eigenvec column: @{$trailing_eigenvecs[$j]}\n" if $self->{_debug};
                $trailing_eigenvec_matrix->set_col($j, $trailing_eigenvecs[$j]);
            }
            push @trailing_eigenvec_matrices_for_all_subspaces,$trailing_eigenvec_matrix;
            push @reference_vecs_for_all_subspaces, $mean;
        }
        $self->set_termination_reconstruction_error_threshold(\@reference_vecs_for_all_subspaces);
        my %best_subspace_based_partition_of_data;
        foreach my $i (0..$self->{_KM}-1) {
            $best_subspace_based_partition_of_data{$i} = [];
        }
        foreach my $data_tag (@{$self->{_data_tags}}) {
            my $data_vec = Math::GSL::Matrix->new($self->{_data_dimensions},1);
            $data_vec->set_col(0, $self->{_data_hash}->{$data_tag});
            my @errors = map {$self->reconstruction_error($data_vec,
                                  $trailing_eigenvec_matrices_for_all_subspaces[$_],
                                  $reference_vecs_for_all_subspaces[$_])}
                         0 .. $self->{_KM}-1;
            my ($minval, $index_for_closest_subspace) = minimum(\@errors);
            $total_reconstruction_error_this_iteration += $minval;
            push @{$best_subspace_based_partition_of_data{$index_for_closest_subspace}},
                                                                      $data_tag;
        }
        print "Finished calculating the eigenvectors for the clusters produced by the previous\n" .
              "iteration and re-assigning the data samples to the new subspaces on the basis of\n".
              "the least reconstruction error.\n\n" .
              "Total reconstruction error in this iteration: $total_reconstruction_error_this_iteration\n"
                  if $self->{_terminal_output};
        foreach my $i (0..$self->{_KM}-1) {
            $clusters[$i] = $best_subspace_based_partition_of_data{$i};
        }
        display_clusters(\@clusters) if $self->{_terminal_output};
        # Check if any cluster has lost all its elements. If so, fragment the worst
        # existing cluster to create the additional clusters needed:
        if (any {@$_ == 0} @clusters) {
            die "empty cluster found" if $self->{_auto_retry_flag};
            print "\nOne or more clusters have become empty.  Will carve out the needed clusters\n" .
                  "from the cluster with the largest subspace construction error.\n\n";
            $total_reconstruction_error_this_iteration = 0;
            @subspace_construction_errors_this_iteration = ();
            my $how_many_extra_clusters_needed = $self->{_KM} - scalar(grep {@$_ != 0} @clusters);
            print "number of extra clusters needed at iteration $iteration_index: $how_many_extra_clusters_needed\n";
            my $max = List::Util::max @subspace_construction_errors_this_iteration;
            my $maxindex = List::Util::first {$_ == $max} @subspace_construction_errors_this_iteration;
            my @cluster_fragments = cluster_split($clusters[$maxindex], 
                                                  $how_many_extra_clusters_needed + 1);
            my @newclusters;
            push @newclusters, @clusters[0 .. $maxindex-1];
            push @newclusters, @clusters[$maxindex+1 .. $self->{_KM}-1];
            push @newclusters, @cluster_fragments;
            @newclusters = grep {@$_ != 0} @newclusters; 
            die "something went wrong with cluster fragmentation" 
                unless $self->{_KM} = @newclusters;
            @trailing_eigenvec_matrices_for_all_subspaces = ();
            @reference_vecs_for_all_subspaces = ();
            foreach my $cluster (@newclusters) {
                die "Linear Manifold Clustering did not work $!" if @$cluster == 0;
                my ($mean, $covariance) = estimate_mean_and_covariance($cluster);
                my ($eigenvecs, $eigenvals) = eigen_analysis_of_covariance($covariance);
                my @trailing_eigenvecs =  @{$eigenvecs}[$self->{_P} .. $self->{_data_dimensions}-1];
                my @trailing_eigenvals =  @{$eigenvals}[$self->{_P} .. $self->{_data_dimensions}-1];
                my $subspace_construction_error = reduce {abs($a) + abs($b)} @trailing_eigenvals;
                push @subspace_construction_errors_this_iteration, $subspace_construction_error;
                my $trailing_eigenvec_matrix = Math::GSL::Matrix->new($self->{_data_dimensions}, 
                                                      scalar(@trailing_eigenvecs));
                foreach my $j (0..@trailing_eigenvecs-1) {
                    $trailing_eigenvec_matrix->set_col($j, $trailing_eigenvecs[$j]);
                }
                push @trailing_eigenvec_matrices_for_all_subspaces,$trailing_eigenvec_matrix;
                push @reference_vecs_for_all_subspaces, $mean;
            }
            my %best_subspace_based_partition_of_data;
            foreach my $i (0..$self->{_KM}-1) {
                $best_subspace_based_partition_of_data{$i} = [];
            }
            foreach my $data_tag (@{$self->{_data_tags}}) {
                my $data_vec = Math::GSL::Matrix->new($self->{_data_dimensions},1);
                $data_vec->set_col(0, $self->{_data_hash}->{$data_tag});
                my @errors = map {reconstruction_error($data_vec,
                                      $trailing_eigenvec_matrices_for_all_subspaces[$_],
                                      $reference_vecs_for_all_subspaces[$_])}
                             0 .. $self->{_KM}-1;
                my ($minval, $index_for_closest_subspace) = minimum(\@errors);
                $total_reconstruction_error_this_iteration += $minval;
                push @{$best_subspace_based_partition_of_data{$index_for_closest_subspace}},
                                                                      $data_tag;
            }
            print "empty-cluster jag: total reconstruction error in this iteration: \n" .
                  "$total_reconstruction_error_this_iteration\n"
                if $self->{_debug};
            foreach my $i (0..$self->{_KM}-1) {
                $clusters[$i] = $best_subspace_based_partition_of_data{$i};
            }
            display_clusters(\@newclusters) if $self->{_terminal_output};
            @clusters = grep {@$_ != 0} @newclusters;
            die "linear manifold based algorithm does not appear to work in this case $!" 
                unless @clusters == $self->{_KM};
        }# end of foreach my $cluster (@clusters) ... loop  followed by if clause for empty clusters
        if ($self->{_data_dimensions} == 3) {

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

            fisher_yates_shuffle($worst_cluster);
            my @first_half = @$worst_cluster[0 .. int(scalar(@$worst_cluster)/2) - 1];
            my @second_half = @$worst_cluster[int(scalar(@$worst_cluster)/2) .. @$worst_cluster - 1];
            push @clusters, \@first_half;
            push @clusters, \@second_half;
            if ($self->{_terminal_output}) {
                print "\n\nShowing the clusters obtained after applying the unimodality correction:\n";
                display_clusters(\@clusters);      
            }
        }
        if (@{$self->{_reconstruction_error_as_a_function_of_iteration}} > 0) {
            my $last_recon_error = pop @{$self->{_reconstruction_error_as_a_function_of_iteration}};
            push @{$self->{_reconstruction_error_as_a_function_of_iteration}}, $last_recon_error;
            if (($last_recon_error - $total_reconstruction_error_this_iteration) 
                                                        < $self->{_delta_normalized_error}) {
                push @{$self->{_reconstruction_error_as_a_function_of_iteration}}, 
                                                      $total_reconstruction_error_this_iteration;
                last;
            }
        }
        push @{$self->{_reconstruction_error_as_a_function_of_iteration}}, 
                                              $total_reconstruction_error_this_iteration;
        $iteration_index++;
        $previous_min_value_for_unimodality_quotient =  $min_value_for_unimodality_quotient;
    } # end of while loop on iteration_index
    $self->{_num_iterations_actually_used} = 
                                 scalar @{$self->{_reconstruction_error_as_a_function_of_iteration}};
    if ($self->{_terminal_output}) {
        print "\nIterations of the main loop terminated at iteration number $iteration_index.\n";
        print "Will now invoke graph partitioning to discover dominant clusters and to\n" .
              "merge small clusters.\n\n" if $self->{_cluster_search_multiplier} > 1;
        print "Total reconstruction error as a function of iterations: " .
                                       "@{$self->{_reconstruction_error_as_a_function_of_iteration}}";
    }
    # now merge sub-clusters if cluster_search_multiplier > 1
    my @final_clusters;
    if ($self->{_cluster_search_multiplier} > 1) {
        print "\n\nInvoking recursive graph partitioning to merge small clusters\n\n";
        my @array_of_partitioned_cluster_groups = (\@clusters);
        my @partitioned_cluster_groups;
        my $how_many_clusters_looking_for = $self->{_K};
        while (scalar(@final_clusters) < $self->{_K}) {
            @partitioned_cluster_groups = 

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

            @final_clusters = @new_final_clusters;
        }
    } else {
        @final_clusters = @clusters;
    }
    print "\n\nDisplaying final clustering results:\n\n" if $self->{_terminal_output};
    display_clusters(\@final_clusters) if $self->{_terminal_output};
    return \@final_clusters;
} 

sub display_reconstruction_errors_as_a_function_of_iterations {
    my $self = shift;            
    print "\n\nNumber of iterations used in Phase 1: $self->{_num_iterations_actually_used}\n";
    print "\nTotal reconstruction error as a function of iterations in Phase 1: " .
          "@{$self->{_reconstruction_error_as_a_function_of_iteration}}\n";
}

sub set_termination_reconstruction_error_threshold {
    my $self = shift;        
    my $all_ref_vecs = shift;
    my @mean_vecs = @$all_ref_vecs;
    my $sum_of_mean_magnitudes = reduce {$a+$b} map { my $result = transpose($_) * $_; 
                                                      my @result = $result->as_list; 
                                                      sqrt($result[0])
                                                    } @mean_vecs;
    $self->{_scale_factor} = $sum_of_mean_magnitudes / @mean_vecs;
    $self->{_delta_normalized_error} = ($sum_of_mean_magnitudes / @mean_vecs ) * 
                                       $self->{_delta_reconstruction_error};
}

# This method is called only in the `unless' clause at the end of the main
# linear_manifold_clusterer() method.  It is called to find the cluster labels for
# those data elements that were left unclustered by the main part of the algorithm
# when graph partitioning is used to merge similar sub-clusters.  The operating logic
# here is that graph partition yields K main clusters even though each main cluster
# may not yet be fully populated.
sub which_cluster_for_new_element {
    my $self = shift;    

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

            }
            push @trailing_eigenvec_matrices_for_all_subspaces,$trailing_eigenvec_matrix;
            push @reference_vecs_for_all_subspaces, $mean;
        }
        $self->{_final_trailing_eigenvec_matrices_for_all_subspaces} = 
                                        \@trailing_eigenvec_matrices_for_all_subspaces;
        $self->{_final_reference_vecs_for_all_subspaces} = \@reference_vecs_for_all_subspaces;
    }
    my $data_vec = Math::GSL::Matrix->new($self->{_data_dimensions},1);
    $data_vec->set_col(0, $self->{_data_hash}->{$data_tag});
    my @errors = map {$self->reconstruction_error($data_vec,
                          $self->{_final_trailing_eigenvec_matrices_for_all_subspaces}->[$_],
                          $self->{_final_reference_vecs_for_all_subspaces}->[$_])}
                         0 .. $self->{_K}-1;
    my ($minval, $index_for_closest_subspace) = minimum(\@errors);
    return $index_for_closest_subspace;
}

sub graph_partition {
    my $self = shift;
    my $clusters = shift;
    my $how_many_clusters_looking_for = shift;
    print "\n\nGraph partitioning looking for $how_many_clusters_looking_for clusters\n\n";
    my $num_nodes = scalar @$clusters;
    my $W = Math::GSL::Matrix->new($num_nodes,$num_nodes);
    my $D = Math::GSL::Matrix->new($num_nodes,$num_nodes);
    $D->identity;
    my $neg_sqrt_of_D = Math::GSL::Matrix->new($num_nodes,$num_nodes);
    $neg_sqrt_of_D->identity;
    my @subspace_construction_errors;
    my @trailing_eigenvec_matrices_for_all_subspaces;
    my @reference_vecs_for_all_subspaces;
    foreach my $cluster (@$clusters) {
        my ($mean, $covariance) = $self->estimate_mean_and_covariance($cluster);
        my ($eigenvecs, $eigenvals) = $self->eigen_analysis_of_covariance($covariance);
        my @trailing_eigenvecs =  @{$eigenvecs}[$self->{_P} .. $self->{_data_dimensions}-1];
        my @trailing_eigenvals =  @{$eigenvals}[$self->{_P} .. $self->{_data_dimensions}-1];
        my $subspace_construction_error = reduce {abs($a) + abs($b)} @trailing_eigenvals;
        push @subspace_construction_errors, $subspace_construction_error;
        my $trailing_eigenvec_matrix = Math::GSL::Matrix->new($self->{_data_dimensions}, 
                                                             scalar(@trailing_eigenvecs));
        foreach my $j (0..@trailing_eigenvecs-1) {
            print "trailing eigenvec column: @{$trailing_eigenvecs[$j]}\n" if $self->{_debug};
            $trailing_eigenvec_matrix->set_col($j, $trailing_eigenvecs[$j]);
        }
        push @trailing_eigenvec_matrices_for_all_subspaces,$trailing_eigenvec_matrix;
        push @reference_vecs_for_all_subspaces, $mean;
    }
    # We consider the similarity matrix W to be a sum of two parts W_recon_based and
    # W_dist_bet_means based. IMPORTANT: For coding reasons, we first store the two
    # similarity measures separately in W_recon_based and W_dist_bet_means based. Our
    # goal is to fill up these component matrices with the raw values while at the
    # same time collecting information needed for normalizing these two separate
    # measures of similarity.
    my $W_reconstruction_error_based = Math::GSL::Matrix->new($num_nodes,$num_nodes);
    my $W_dist_between_means_based = Math::GSL::Matrix->new($num_nodes,$num_nodes);
    my @all_pairwise_reconstruction_errors;
    my @all_dist_between_means_errors;
    foreach my $i (0..$num_nodes-1) {
        foreach my $j (0..$num_nodes-1) {
            my ($recon_error_similarity, $dist_bet_means) = $self->pairwise_cluster_similarity(
                                                      $clusters->[$i], 
                                                      $trailing_eigenvec_matrices_for_all_subspaces[$i],
                                                      $reference_vecs_for_all_subspaces[$i],
                                                      $clusters->[$j],
                                                      $trailing_eigenvec_matrices_for_all_subspaces[$j],
                                                      $reference_vecs_for_all_subspaces[$j]);
            $W_reconstruction_error_based->set_elem($i, $j, $recon_error_similarity);
            $W_dist_between_means_based->set_elem($i, $j, $dist_bet_means);
            push @all_pairwise_reconstruction_errors, $recon_error_similarity;
            push @all_dist_between_means_errors, $dist_bet_means;
        }
    }
    my $recon_error_normalizer = (reduce {$a + $b} @all_pairwise_reconstruction_errors) / 
                                 (scalar @all_pairwise_reconstruction_errors);
    my $dist_bet_means_based_normalizer = (reduce {$a + $b} @all_dist_between_means_errors ) /
                                 (scalar @all_dist_between_means_errors );
    die "\n\nBailing out!\n" .
        "Dealing with highly defective clusters.  Try again.\n" 
        if ($recon_error_normalizer == 0) || ($dist_bet_means_based_normalizer == 0);
    foreach my $i (0..$num_nodes-1) {
        foreach my $j (0..$num_nodes-1) {
            my $recon_val = $W_reconstruction_error_based->get_elem($i,$j);
            my $new_recon_val = exp( -1.0 * $recon_val / $recon_error_normalizer );
            $W_reconstruction_error_based->set_elem($i,$j,$new_recon_val);
            my $mean_dist_val = $W_dist_between_means_based->get_elem($i,$j);
            my $new_mean_dist_val = exp( -1.0 * $mean_dist_val / $dist_bet_means_based_normalizer );
            $W_dist_between_means_based->set_elem($i,$j,$new_mean_dist_val);
        }
    }
    $W = $W_reconstruction_error_based + $W_dist_between_means_based;
    if ($self->{_debug}) {
        print "\nDisplaying the similarity matrix W for the cluster graph:\n";
        display_matrix($W) if $self->{_terminal_output};
    }
    my $add_all_columns = Math::GSL::Matrix->new($num_nodes,1);
    foreach my $col (0..$num_nodes-1) {
        $add_all_columns += $W->col($col);
    }
    foreach my $i (0..$num_nodes-1) {
        $D->set_elem($i,$i, $add_all_columns->get_elem($i,0));

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

}

sub pairwise_cluster_similarity {
    my $self = shift;
    my $cluster1 = shift;
    my $trailing_eigenvec_matrix_cluster1 = shift;   
    my $reference_vec_cluster1 = shift;
    my $cluster2 = shift;
    my $trailing_eigenvec_matrix_cluster2 = shift;   
    my $reference_vec_cluster2 = shift;
    my $total_reconstruction_error_in_this_iteration = 0;
    my @errors_for_1_on_2 = map {my $data_vec = Math::GSL::Matrix->new($self->{_data_dimensions},1);
                                 $data_vec->set_col(0, $self->{_data_hash}->{$_});
                                 $self->reconstruction_error($data_vec,
                                                             $trailing_eigenvec_matrix_cluster2,
                                                             $reference_vec_cluster2)} 
                                 @$cluster1;
    my @errors_for_2_on_1 = map {my $data_vec = Math::GSL::Matrix->new($self->{_data_dimensions},1);
                                 $data_vec->set_col(0, $self->{_data_hash}->{$_});
                                 $self->reconstruction_error($data_vec,
                                                             $trailing_eigenvec_matrix_cluster1,
                                                             $reference_vec_cluster1)} 
                                 @$cluster2;
    my $type_1_error = reduce {abs($a) + abs($b)} @errors_for_1_on_2;
    my $type_2_error = reduce {abs($a) + abs($b)} @errors_for_2_on_1;
    my $total_reconstruction_error = $type_1_error + $type_2_error;
    my $diff_between_the_means = $reference_vec_cluster1 - $reference_vec_cluster2;
    my $dist_squared = transpose($diff_between_the_means) * $diff_between_the_means;
    my @dist_squared_as_list = $dist_squared->as_list();
    my $dist_between_means_based_error = shift @dist_squared_as_list;
    return ($total_reconstruction_error, $dist_between_means_based_error);
}

# delta ball
sub cluster_unimodality_quotient {
    my $self = shift;
    my $cluster = shift;
    my $mean = shift;   
    my $delta = 0.4 * $self->{_scale_factor};  # Radius of the delta ball along each dimension
    my @mean = $mean->as_list;
    my @data_tags_for_range_tests;

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

            push @coordinate_vec,  $min_bounds[$dimen] + rand($ranges[$dimen]);
        }
        push @random_points, \@coordinate_vec;
    }
    if ($self->{_debug}) {
        print "\nrandom points\n";
        map {print "@$_\n"} @random_points;
    }
    my @mean = $mean->as_list;
    unshift @random_points, \@mean;
    my @reconstruction_errors;
    foreach my $candidate_ref_vec (@random_points) {
        my $ref_vec = Math::GSL::Matrix->new($self->{_data_dimensions},1);
        $ref_vec->set_col(0, $candidate_ref_vec);
        my $reconstruction_error_for_a_ref_vec = 0;
        foreach my $data_tag (@{$self->{_data_tags}}) {
            my $data_vec = Math::GSL::Matrix->new($self->{_data_dimensions},1);
            $data_vec->set_col(0, $self->{_data_hash}->{$data_tag});
            my $error = $self->reconstruction_error($data_vec,$trailing_eigenvec_matrix,$ref_vec);
            $reconstruction_error_for_a_ref_vec += $error;
        }
        push @reconstruction_errors, $reconstruction_error_for_a_ref_vec;
    }
    my $recon_error_for_original_mean = shift @reconstruction_errors;
    my $smallest_error_randomly_selected_ref_vecs = List::Util::min(@reconstruction_errors);
    my $minindex = List::Util::first { $_ == $smallest_error_randomly_selected_ref_vecs }
                                                    @reconstruction_errors;
    my $refvec = $random_points[$minindex];
    return $refvec;
}

##  The reconstruction error relates to the size of the perpendicular from a data
##  point X to the hyperplane that defines a given subspace on the manifold.
sub reconstruction_error {
    my $self = shift;
    my $data_vec = shift;
    my $trailing_eigenvecs = shift;
    my $ref_vec = shift;    
    my $error_squared = transpose($data_vec - $ref_vec) * $trailing_eigenvecs *  
                                 transpose($trailing_eigenvecs) * ($data_vec - $ref_vec);
    my @error_squared_as_list = $error_squared->as_list();
    my $error_squared_as_scalar = shift @error_squared_as_list;
    return $error_squared_as_scalar;
}

# Returns a set of KM random integers.  These serve as indices to reach into the data
# array.  A data element whose index is one of the random numbers returned by this
# routine serves as an initial cluster center.  Note the quality check it runs on the
# list of the random integers constructed.  We first make sure that all the random
# integers returned are different.  Subsequently, we carry out a quality assessment
# of the random integers constructed.  This quality measure consists of the ratio of
# the values spanned by the random integers to the value of N, the total number of
# data points to be clustered.  Currently, if this ratio is less than 0.3, we discard

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN


sub check_for_illegal_params {
    my @params = @_;
    my @legal_params = qw / datafile
                            mask
                            K
                            P
                            terminal_output
                            cluster_search_multiplier
                            max_iterations
                            delta_reconstruction_error
                            visualize_each_iteration
                            show_hidden_in_3D_plots
                            make_png_for_each_iteration
                            debug
                          /;
    my $found_match_flag;
    foreach my $param (@params) {
        foreach my $legal (@legal_params) {
            $found_match_flag = 0;
            if ($param eq $legal) {

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN


  #  Now you must construct an instance of the clusterer through a call such as:

  my $clusterer = Algorithm::LinearManifoldDataClusterer->new(
                                    datafile => $datafile,
                                    mask     => $mask,
                                    K        => 3,     
                                    P        => 2,     
                                    max_iterations => 15,
                                    cluster_search_multiplier => 2,
                                    delta_reconstruction_error => 0.001,
                                    terminal_output => 1,
                                    visualize_each_iteration => 1,
                                    show_hidden_in_3D_plots => 1,
                                    make_png_for_each_iteration => 1,
                  );

  #  where the parameter K specifies the number of clusters you expect to find in
  #  your data and the parameter P is the dimensionality of the manifold on which the
  #  data resides.  The parameter cluster_search_multiplier is for increasing the
  #  odds that the random seeds chosen initially for clustering will populate all the
  #  clusters.  Set this parameter to a low number like 2 or 3. The parameter
  #  max_iterations places a hard limit on the number of iterations that the
  #  algorithm is allowed.  The actual number of iterations is controlled by the
  #  parameter delta_reconstruction_error.  The iterations stop when the change in
  #  the total "reconstruction error" from one iteration to the next is smaller than
  #  the value specified by delta_reconstruction_error.

  #  Next, you must get the module to read the data for clustering:

  $clusterer->get_data_from_csv();

  #  Finally, you invoke linear manifold clustering by:

  my $clusters = $clusterer->linear_manifold_clusterer();

  #  The value returned by this call is a reference to an array of anonymous arrays,
  #  with each anonymous array holding one cluster.  If you wish, you can have the
  #  module write the clusters to individual files by the following call:

  $clusterer->write_clusters_to_files($clusters);

  #  If you want to see how the reconstruction error changes with the iterations, you
  #  can make the call:

  $clusterer->display_reconstruction_errors_as_a_function_of_iterations();

  #  When your data is 3-dimensional and when the clusters reside on a surface that
  #  is more or less spherical, you can visualize the clusters by calling

  $clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

  #  where the first argument is a label to be displayed in the 3D plot and the
  #  second argument the value returned by calling linear_manifold_clusterer().

  #  SYNTHETIC DATA GENERATION:

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

                             total_number_of_samples_needed => 1000,
                             number_of_clusters_on_sphere => 4,
                             show_hidden_in_3D_plots => 0,
                          );
  $training_data_gen->gen_data_and_write_to_csv();
  $training_data_gen->visualize_data_on_sphere($output_file);


=head1 CHANGES

Version 1.01: Typos and other errors removed in the documentation. Also included in
the documentation a link to a tutorial on data processing on manifolds.


=head1 DESCRIPTION

If you are new to machine learning and data clustering on linear and nonlinear
manifolds, your first question is likely to be: What is a manifold?  A manifold is a
space that is locally Euclidean. And a space is locally Euclidean if it allows for
the points in a small neighborhood to be represented by, say, the Cartesian
coordinates and if the distances between the points in the neighborhood are given by

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

approximation to data that falls on a curve as opposed to constructing a single
straight line approximation to all of the data.  So whereas the frequently used PCA
algorithm gives you a single hyperplane approximation to all your data, what this
module returns is a set of hyperplane approximations, with each hyperplane derived by
applying the PCA algorithm locally to a data cluster.

That brings us to the problem of how to actually discover the best set of hyperplane
approximations to the data.  What is probably the most popular algorithm today for
that purpose is based on the following key idea: Given a set of subspaces to which a
data element can be assigned, you assign it to that subspace for which the
B<reconstruction error> is the least.  But what do we mean by a B<subspace> and what
is B<reconstruction error>?

To understand the notions of B<subspace> and B<reconstruction-error>, let's revisit
the traditional approach of dimensionality reduction by the PCA algorithm.  The PCA
algorithm consists of: (1) Subtracting from each data element the global mean of the
data; (2) Calculating the covariance matrix of the data; (3) Carrying out an
eigendecomposition of the covariance matrix and ordering the eigenvectors according
to decreasing values of the corresponding eigenvalues; (4) Forming a B<subspace> by
discarding the trailing eigenvectors whose corresponding eigenvalues are relatively
small; and, finally, (5) projecting all the data elements into the subspace so
formed. The error incurred in representing a data element by its projection into the
subspace is known as the B<reconstruction error>.  This error is the projection of
the data element into the space spanned by the discarded trailing eigenvectors.

I<In linear-manifold based machine learning, instead of constructing a single
subspace in the manner described above, we construct a set of subspaces, one for each
data cluster on the nonlinear manifold.  After the subspaces have been constructed, a
data element is assigned to that subspace for which the reconstruction error is the
least.> On the face of it, this sounds like a chicken-and-egg sort of a problem.  You
need to have already clustered the data in order to construct the subspaces at
different places on the manifold so that you can figure out which cluster to place a
data element in.

Such problems, when they do possess a solution, are best tackled through iterative
algorithms in which you start with a guess for the final solution, you rearrange the
measured data on the basis of the guess, and you then use the new arrangement of the
data to refine the guess.  Subsequently, you iterate through the second and the third
steps until you do not see any discernible changes in the new arrangements of the
data.  This forms the basis of the clustering algorithm that is described under
B<Phase 1> in the section that follows.  This algorithm was first proposed in the
article "Dimension Reduction by Local Principal Component Analysis" by Kambhatla and
Leen that appeared in the journal Neural Computation in 1997.

Unfortunately, experiments show that the algorithm as proposed by Kambhatla and Leen
is much too sensitive to how the clusters are seeded initially.  To get around this
limitation of the basic clustering-by-minimization-of-reconstruction-error, this
module implements a two phased approach.  In B<Phase 1>, we introduce a multiplier
effect in our search for clusters by looking for C<M*K> clusters instead of the main
C<K> clusters.  In this manner, we increase the odds that each original cluster will
be visited by one or more of the C<M*K> randomly selected seeds at the beginning,
where C<M> is the integer value given to the constructor parameter
C<cluster_search_multiplier>.  Subsequently, we merge the clusters that belong
together in order to form the final C<K> clusters.  That work is done in B<Phase 2>
of the algorithm.

For the cluster merging operation in Phase 2, we model the C<M*K> clusters as the
nodes of an attributed graph in which the weight given to an edge connecting a pair
of nodes is a measure of the similarity between the two clusters corresponding to the
two nodes.  Subsequently, we use spectral clustering to merge the most similar nodes
in our quest to partition the data into C<K> clusters.  For that purpose, we use the
Shi-Malik normalized cuts algorithm.  The pairwise node similarity required by this
algorithm is measured by the C<pairwise_cluster_similarity()> method of the
C<LinearManifoldDataClusterer> class.  The smaller the overall reconstruction error
when all of the data elements in one cluster are projected into the other's subspace
and vice versa, the greater the similarity between two clusters.  Additionally, the
smaller the distance between the mean vectors of the clusters, the greater the
similarity between two clusters.  The overall similarity between a pair of clusters
is a combination of these two similarity measures.

For additional information regarding the theoretical underpinnings of the algorithm
implemented in this module, visit
L<https://engineering.purdue.edu/kak/Tutorials/ClusteringDataOnManifolds.pdf>

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN


We now present a summary of the two phases of the algorithm implemented in this
module.  Note particularly the important role played by the constructor parameter
C<cluster_search_multiplier>.  It is only when the integer value given to this
parameter is greater than 1 that Phase 2 of the algorithm kicks in.

=over 4

=item B<Phase 1:>

Through iterative minimization of the total reconstruction error, this phase of the
algorithm returns C<M*K> clusters where C<K> is the actual number of clusters you
expect to find in your data and where C<M> is the integer value given to the
constructor parameter C<cluster_search_multiplier>.  As previously mentioned, on
account of the sensitivity of the reconstruction-error based clustering to how the
clusters are initially seeded, our goal is to look for C<M*K> clusters with the idea
of increasing the odds that each of the C<K> clusters will see at least one seed at
the beginning of the algorithm.

=over 4

=item Step 1:

Randomly choose C<M*K> data elements to serve as the seeds for that many clusters.

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

Construct initial C<M*K> clusters by assigning each data element to that cluster
whose seed it is closest to.

=item Step 3:

Calculate the mean and the covariance matrix for each of the C<M*K> clusters and
carry out an eigendecomposition of the covariance matrix.  Order the eigenvectors in
decreasing order of the corresponding eigenvalues.  The first C<P> eigenvectors
define the subspace for that cluster.  Use the space spanned by the remaining
eigenvectors --- we refer to them as the trailing eigenvectors --- for calculating
the reconstruction error.

=item Step 4:

Taking into account the mean associated with each cluster, re-cluster the entire data
set on the basis of the least reconstruction error.  That is, assign each data
element to that subspace for which it has the smallest reconstruction error.
Calculate the total reconstruction error associated with all the data elements.  (See
the definition of the reconstruction error in the C<Description> section.)

=item Step 5:

Stop iterating if the change in the total reconstruction error from the previous 
iteration to the current iteration is less than the value specified by the constructor
parameter C<delta_reconstruction_error>.  Otherwise, go back to Step 3.

=back

=item B<Phase 2:>

This phase of the algorithm uses graph partitioning to merge the C<M*K> clusters back
into the C<K> clusters you expect to see in your data.  Since the algorithm whose
steps are presented below is invoked recursively, let's assume that we have C<N>
clusters that need to be merged by invoking the Shi-Malik spectral clustering
algorithm as described below:

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

=item Step 1: 

Form a graph whose C<N> nodes represent the C<N> clusters.  (For the very first
invocation of this step, we have C<N = M*K>.)

=item Step 2:

Construct an C<NxN> similarity matrix for the nodes in the graph. The C<(i,j)>-th
element of this matrix is the pairwise similarity between the clusters indexed C<i>
and C<j>. Calculate this similarity on the basis of the following two criteria: (1)
The total reconstruction error when the data elements in the cluster indexed C<j> are
projected into the subspace for the cluster indexed C<i> and vice versa. And (2) The
distance between the mean vectors associated with the two clusters.

=item Step 3:

Calculate the symmetric normalized Laplacian of the similarity matrix.  We use C<A>
to denote the symmetric normalized Laplacian.

=item Step 4:

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

partition, add the `one' cluster to the output list of clusters and invoke graph
partitioning recursively on the `rest' by going back to Step 1.  On the other hand,
if the cardinality of both the partitions returned by Step 4 exceeds 1, invoke graph
partitioning recursively on both partitions.  Stop when the list of clusters in the
output list equals C<K>.

=item Step 6:

In general, the C<K> clusters obtained by recursive graph partitioning will not cover
all of the data.  So, for the final step, assign each data element not covered by the
C<K> clusters to that cluster for which its reconstruction error is the least.

=back

=back

=head1 FAIL-FIRST BIAS OF THE MODULE

As you would expect for all such iterative algorithms, the module carries no
theoretical guarantee that it will give you correct results. But what does that mean?
Suppose you create synthetic data that consists of Gaussian looking disjoint clusters

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN


=item B<new():>

    my $clusterer = Algorithm::LinearManifoldDataClusterer->new(
                                        datafile                    => $datafile,
                                        mask                        => $mask,
                                        K                           => $K,
                                        P                           => $P,     
                                        cluster_search_multiplier   => $C,
                                        max_iterations              => $max_iter,
                                        delta_reconstruction_error  => 0.001,
                                        terminal_output             => 1,
                                        write_clusters_to_files     => 1,
                                        visualize_each_iteration    => 1,
                                        show_hidden_in_3D_plots     => 1,
                                        make_png_for_each_iteration => 1,
                    );

A call to C<new()> constructs a new instance of the
C<Algorithm::LinearManifoldDataClusterer> class.

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN


=item C<P>:

This parameter specifies the dimensionality of the manifold on which the data resides.

=item C<cluster_search_multiplier>:

As should be clear from the C<Summary of the Algorithm> section, this parameter plays
a very important role in the successful clustering of your data.  As explained in
C<Description>, the basic algorithm used for clustering in Phase 1 --- clustering by
the minimization of the reconstruction error --- is sensitive to the choice of the
cluster seeds that are selected randomly at the beginning of the algorithm.  Should
it happen that the seeds miss one or more of the clusters, the clustering produced is
likely to not be correct.  By giving an integer value to C<cluster_search_multiplier>
that is greater than 1, you'll increase the odds that the randomly selected seeds
will see all clusters.  When you set C<cluster_search_multiplier> to C<M>, you ask
Phase 1 of the algorithm to construct C<M*K> clusters as opposed to just C<K>
clusters. Subsequently, in Phase 2, the module uses inter-cluster similarity based
graph partitioning to merge the C<M*K> clusters into C<K> clusters.

=item C<max_iterations>:

This hard limits the number of iterations in Phase 1 of the algorithm.  Ordinarily,
the iterations stop automatically when the change in the total reconstruction error
from one iteration to the next is less than the value specified by the parameter
C<delta_reconstruction_error>.

=item C<delta_reconstruction_error>:

It is this parameter that determines when the iterations will actually terminate in
Phase 1 of the algorithm.  When the difference in the total reconstruction error from
one iteration to the next is less than the value given to this parameter, the
iterations come to an end. B<IMPORTANT: I have noticed that the larger the number of
data samples that need to be clustered, the larger must be the value give to this
parameter.  That makes intuitive sense since the total reconstruction error is the
sum of all such errors for all of the data elements.> Unfortunately, the best value
for this parameter does NOT appear to depend linearly on the total number of data
records to be clustered.

=item C<terminal_output>:

When this parameter is set, you will see in your terminal window the different
clusters as lists of the symbolic tags associated with the data records.  You will
also see in your terminal window the output produced by the different steps of the
graph partitioning algorithm as smaller clusters are merged to form the final C<K>
clusters --- assuming that you set the parameter C<cluster_search_multiplier> to an

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

    or 

    my $clusters = $clusterer->linear_manifold_clusterer();

This is the main call to the linear-manifold based clusterer.  The first call works
by side-effect, meaning that you will see the clusters in your terminal window and
they would be written out to disk files (depending on the constructor options you
have set).  The second call also returns the clusters as a reference to an array of
anonymous arrays, each holding the symbolic tags for a cluster.

=item B<display_reconstruction_errors_as_a_function_of_iterations()>:

    $clusterer->display_reconstruction_errors_as_a_function_of_iterations();

This method would normally be called after the clustering is completed to see how the
reconstruction errors decreased with the iterations in Phase 1 of the overall
algorithm.

=item B<write_clusters_to_files()>:

    $clusterer->write_clusters_to_files($clusters);

As its name implies, when you call this methods, the final clusters would be written
out to disk files.  The files have names like:

     cluster0.txt 



( run in 0.350 second using v1.01-cache-2.11-cpan-65fba6d93b7 )