Algorithm-LinearManifoldDataClusterer

 view release on metacpan or  search on metacpan

MANIFEST  view on Meta::CPAN

examples/example1.pl
examples/example2.pl
examples/example3.pl
examples/example4.pl
examples/generate_data_on_a_sphere.pl
examples/3_clusters_on_a_sphere_3000_samples.csv
examples/3_clusters_on_a_sphere_498_samples.csv
examples/4_clusters_on_a_sphere_1000_samples.csv
examples/data_visualizer.pl
examples/example1_initial_clustering.png
examples/example1_clustering_after_3_iterations.png
examples/example1_clustering_after_graph_partitioning.png
examples/example1_final_clustering.png
examples/example2_initial_clustering.png
examples/example2_clustering_after_3_iterations.png
examples/example2_clustering_after_8_iterations.png
examples/example2_clustering_after_graph_partitioning.png
examples/example2_final_clustering.png
examples/example3_initial_clustering.png
examples/example3_clustering_after_3_iterations.png
examples/example3_clustering_after_7_iterations.png
examples/example3_clustering_after_graph_partitioning.png
examples/example3_final_clustering.png
examples/README
lib/Algorithm/LinearManifoldDataClusterer.pm
Makefile.PL
MANIFEST			This list of files
README
t/test.t
META.yml                                 Module YAML meta-data (added by MakeMaker)
META.json                                Module JSON meta-data (added by MakeMaker)

examples/example1.pl  view on Meta::CPAN


my $datafile = "3_clusters_on_a_sphere_498_samples.csv";

my $mask = "N111"; 

my $clusterer = Algorithm::LinearManifoldDataClusterer->new( 
                                    datafile => $datafile,
                                    mask     => $mask,
                                    K        => 3,     # number of clusters
                                    P        => 2,     # manifold dimensionality
                                    max_iterations => 15,
                                    cluster_search_multiplier => 2,
                                    delta_reconstruction_error => 0.001,
                                    terminal_output => 1,
                                    visualize_each_iteration => 1,
                                    show_hidden_in_3D_plots => 1,
                                    make_png_for_each_iteration => 1,
                );

$clusterer->get_data_from_csv();

my $clusters = $clusterer->linear_manifold_clusterer();

$clusterer->display_reconstruction_errors_as_a_function_of_iterations();

$clusterer->write_clusters_to_files($clusters);

$clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

# Now make a png image file that shows the final clusters:
$clusterer->visualize_clusters_on_sphere("final_clustering", $clusters, "png");

examples/example2.pl  view on Meta::CPAN


my $datafile = "3_clusters_on_a_sphere_3000_samples.csv";

my $mask = "N111";         # for sphericaldata.csv

my $clusterer = Algorithm::LinearManifoldDataClusterer->new( 
                                    datafile => $datafile,
                                    mask     => $mask,
                                    K        => 3,     # number of clusters
                                    P        => 2,     # manifold dimensionality
                                    max_iterations => 15,
                                    cluster_search_multiplier => 2,
                                    delta_reconstruction_error => 0.012,
                                    terminal_output => 1,
                                    visualize_each_iteration => 1,
                                    show_hidden_in_3D_plots => 0,
                                    make_png_for_each_iteration => 1,
                );

$clusterer->get_data_from_csv();

my $clusters = $clusterer->linear_manifold_clusterer();

$clusterer->display_reconstruction_errors_as_a_function_of_iterations();

$clusterer->write_clusters_to_files($clusters);

$clusterer->visualize_clusters_on_sphere("final_clustering", $clusters);

# Now make a png image file that shows the final clusters:
$clusterer->visualize_clusters_on_sphere("final_clustering", $clusters, "png");

examples/example3.pl  view on Meta::CPAN

my $datafile = "4_clusters_on_a_sphere_1000_samples.csv";

my $mask = "N111";         # for sphericaldata.csv

my $clusterer = Algorithm::LinearManifoldDataClusterer->new( 
                                    datafile => $datafile,
                                    mask     => $mask,
                                    K        => 4,     # number of clusters
                                    P        => 2,     # manifold dimensionality
                                    cluster_search_multiplier => 2,
                                    max_iterations => 15,
                                    delta_reconstruction_error => 0.002,
                                    terminal_output => 1,
                                    visualize_each_iteration => 1,
                                    show_hidden_in_3D_plots => 1,
                                    make_png_for_each_iteration => 1,
                );

$clusterer->get_data_from_csv();

my $clusters = $clusterer->linear_manifold_clusterer();

$clusterer->display_reconstruction_errors_as_a_function_of_iterations();

$clusterer->write_clusters_to_files($clusters);

$clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

# Now make a png image file that shows the final clusters:
$clusterer->visualize_clusters_on_sphere("final_clustering", $clusters, "png");

examples/example4.pl  view on Meta::CPAN


my $datafile = "3_clusters_on_a_sphere_498_samples.csv";

my $mask = "N111"; 

my $clusterer = Algorithm::LinearManifoldDataClusterer->new( 
                                    datafile => $datafile,
                                    mask     => $mask,
                                    K        => 3,     # number of clusters
                                    P        => 2,     # manifold dimensionality
                                    max_iterations => 15,
                                    cluster_search_multiplier => 1,
                                    delta_reconstruction_error => 0.001,
                                    terminal_output => 1,
                                    visualize_each_iteration => 1,
                                    show_hidden_in_3D_plots => 1,
                                    make_png_for_each_iteration => 1,
                );

$clusterer->get_data_from_csv();

my $clusters = $clusterer->auto_retry_clusterer();

$clusterer->display_reconstruction_errors_as_a_function_of_iterations();

$clusterer->write_clusters_to_files($clusters);

$clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

# Now make a png image file that shows the final clusters:
$clusterer->visualize_clusters_on_sphere("final_clustering", $clusters, "png");

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

    my @params = keys %args;
    croak "\nYou have used a wrong name for a keyword argument " .
          "--- perhaps a misspelling\n" 
          if check_for_illegal_params(@params) == 0;
    bless {
        _datafile                     =>   $args{datafile} || croak("datafile required"),
        _mask                         =>   $args{mask}     || croak("mask required"),
        _K                            =>   $args{K}        || 0,
        _P                            =>   $args{P}        || 0,
        _terminal_output              =>   $args{terminal_output} || 0,
        _max_iterations               =>   $args{max_iterations} || 0,
        _delta_reconstruction_error   =>   $args{delta_reconstruction_error} || 0.001,
        _delta_normalized_error       =>   undef,
        _cluster_search_multiplier    =>   $args{cluster_search_multiplier} || 1,
        _visualize_each_iteration     =>   $args{visualize_each_iteration} == 0 ? 0 : 1,
        _show_hidden_in_3D_plots      =>   $args{show_hidden_in_3D_plots} == 0 ? 0 : 1,
        _make_png_for_each_iteration  =>   $args{make_png_for_each_iteration} == 0 ? 0 : 1,
        _debug                        =>   $args{debug} || 0,
        _N                            =>   0,
        _KM                           =>   $args{K} * $args{cluster_search_multiplier},
        _data_hash                    =>   {},
        _data_tags                    =>   [],
        _data_dimensions              =>   0,
        _final_clusters               =>   [],
        _auto_retry_flag              =>   0,
        _num_iterations_actually_used =>   undef,
        _scale_factor                 =>   undef,
        _data_tags_to_cluster_label_hash  => {},
        _final_reference_vecs_for_all_subspaces => [],
        _reconstruction_error_as_a_function_of_iteration => [],
        _final_trailing_eigenvec_matrices_for_all_subspaces => [],
        _subspace_construction_error_as_a_function_of_iteration => [],
    }, $class;
}

sub get_data_from_csv {

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

    }
    foreach my $cluster (@$initial_clusters) {
        my ($mean, $covariance) = $self->estimate_mean_and_covariance($cluster);
        display_mean_and_covariance($mean, $covariance) if $self->{_debug};
    }
    my @clusters = @$initial_clusters;
    display_clusters(\@clusters) if $self->{_debug};
    my $iteration_index = 0;
    my $unimodal_correction_flag;
    my $previous_min_value_for_unimodality_quotient;
    while ($iteration_index < $self->{_max_iterations}) {
        print "\n\n========================== STARTING ITERATION $iteration_index =====================\n\n"
            if $self->{_terminal_output};
        my $total_reconstruction_error_this_iteration = 0;
        my @subspace_construction_errors_this_iteration;
        my @trailing_eigenvec_matrices_for_all_subspaces;
        my @reference_vecs_for_all_subspaces;
        foreach my $cluster (@clusters) {
            next if @$cluster == 0;
            my ($mean, $covariance) = $self->estimate_mean_and_covariance($cluster);
            display_mean_and_covariance($mean, $covariance) if $self->{_debug};

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

            $self->visualize_clusters_on_sphere($visualization_msg, \@clusters)
                if $self->{_visualize_each_iteration};
            $self->visualize_clusters_on_sphere($visualization_msg, \@clusters, "png")
                if $self->{_make_png_for_each_iteration};
        }
        my @cluster_unimodality_quotients = map {$self->cluster_unimodality_quotient($clusters[$_], 
                                                      $reference_vecs_for_all_subspaces[$_])} 0..@clusters-1;
        my $min_value_for_unimodality_quotient = List::Util::min @cluster_unimodality_quotients;
        print "\nCluster unimodality quotients: @cluster_unimodality_quotients\n" if $self->{_terminal_output};
        die "\n\nBailing out!\n" .
            "It does not look like these iterations will lead to a good clustering result.\n" .
            "Program terminating.  Try running again.\n" 
            if defined($previous_min_value_for_unimodality_quotient)
               && ($min_value_for_unimodality_quotient < 0.4)
               && ($min_value_for_unimodality_quotient < (0.5 * $previous_min_value_for_unimodality_quotient));
        if ( $min_value_for_unimodality_quotient < 0.5 ) {
            $unimodal_correction_flag = 1;
            print "\nApplying unimodality correction:\n\n" if $self->{_terminal_output};
            my @sorted_cluster_indexes = 
               sort {$cluster_unimodality_quotients[$b] <=> $cluster_unimodality_quotients[$a]} 0..@clusters-1;
            my @newclusters;

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

                push @{$self->{_reconstruction_error_as_a_function_of_iteration}}, 
                                                      $total_reconstruction_error_this_iteration;
                last;
            }
        }
        push @{$self->{_reconstruction_error_as_a_function_of_iteration}}, 
                                              $total_reconstruction_error_this_iteration;
        $iteration_index++;
        $previous_min_value_for_unimodality_quotient =  $min_value_for_unimodality_quotient;
    } # end of while loop on iteration_index
    $self->{_num_iterations_actually_used} = 
                                 scalar @{$self->{_reconstruction_error_as_a_function_of_iteration}};
    if ($self->{_terminal_output}) {
        print "\nIterations of the main loop terminated at iteration number $iteration_index.\n";
        print "Will now invoke graph partitioning to discover dominant clusters and to\n" .
              "merge small clusters.\n\n" if $self->{_cluster_search_multiplier} > 1;
        print "Total reconstruction error as a function of iterations: " .
                                       "@{$self->{_reconstruction_error_as_a_function_of_iteration}}";
    }
    # now merge sub-clusters if cluster_search_multiplier > 1
    my @final_clusters;
    if ($self->{_cluster_search_multiplier} > 1) {
        print "\n\nInvoking recursive graph partitioning to merge small clusters\n\n";
        my @array_of_partitioned_cluster_groups = (\@clusters);
        my @partitioned_cluster_groups;
        my $how_many_clusters_looking_for = $self->{_K};
        while (scalar(@final_clusters) < $self->{_K}) {

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

            @final_clusters = @new_final_clusters;
        }
    } else {
        @final_clusters = @clusters;
    }
    print "\n\nDisplaying final clustering results:\n\n" if $self->{_terminal_output};
    display_clusters(\@final_clusters) if $self->{_terminal_output};
    return \@final_clusters;
} 

sub display_reconstruction_errors_as_a_function_of_iterations {
    my $self = shift;            
    print "\n\nNumber of iterations used in Phase 1: $self->{_num_iterations_actually_used}\n";
    print "\nTotal reconstruction error as a function of iterations in Phase 1: " .
          "@{$self->{_reconstruction_error_as_a_function_of_iteration}}\n";
}

sub set_termination_reconstruction_error_threshold {
    my $self = shift;        
    my $all_ref_vecs = shift;
    my @mean_vecs = @$all_ref_vecs;
    my $sum_of_mean_magnitudes = reduce {$a+$b} map { my $result = transpose($_) * $_; 
                                                      my @result = $result->as_list; 
                                                      sqrt($result[0])

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

    my @ranges;
    foreach my $dimen (0..$self->{_data_dimensions}-1) {
        my @values = map {$_->[$dimen]} map {$self->{_data_hash}->{$_}} @$cluster;
        my ($min, $max) = (List::Util::min(@values), List::Util::max(@values));
        push @min_bounds, $min;
        push @max_bounds, $max;
        push @ranges, $max - $min;
    }
    print "min bounds are: @min_bounds\n";
    print "max bounds are: @max_bounds\n";
    my $max_iterations = 100;
    my @random_points;
    my $iteration = 0;
    while ($iteration++ < $max_iterations) {
        my @coordinate_vec;
        foreach my $dimen (0..$self->{_data_dimensions}-1) {        
            push @coordinate_vec,  $min_bounds[$dimen] + rand($ranges[$dimen]);
        }
        push @random_points, \@coordinate_vec;
    }
    if ($self->{_debug}) {
        print "\nrandom points\n";
        map {print "@$_\n"} @random_points;
    }

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

}

sub check_for_illegal_params {
    my @params = @_;
    my @legal_params = qw / datafile
                            mask
                            K
                            P
                            terminal_output
                            cluster_search_multiplier
                            max_iterations
                            delta_reconstruction_error
                            visualize_each_iteration
                            show_hidden_in_3D_plots
                            make_png_for_each_iteration
                            debug
                          /;
    my $found_match_flag;
    foreach my $param (@params) {
        foreach my $legal (@legal_params) {
            $found_match_flag = 0;

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

  #  clustered, the mask would become `N0111' assuming that that the symbolic tag is
  #  still in the first column.

  #  Now you must construct an instance of the clusterer through a call such as:

  my $clusterer = Algorithm::LinearManifoldDataClusterer->new(
                                    datafile => $datafile,
                                    mask     => $mask,
                                    K        => 3,     
                                    P        => 2,     
                                    max_iterations => 15,
                                    cluster_search_multiplier => 2,
                                    delta_reconstruction_error => 0.001,
                                    terminal_output => 1,
                                    visualize_each_iteration => 1,
                                    show_hidden_in_3D_plots => 1,
                                    make_png_for_each_iteration => 1,
                  );

  #  where the parameter K specifies the number of clusters you expect to find in
  #  your data and the parameter P is the dimensionality of the manifold on which the
  #  data resides.  The parameter cluster_search_multiplier is for increasing the
  #  odds that the random seeds chosen initially for clustering will populate all the
  #  clusters.  Set this parameter to a low number like 2 or 3. The parameter
  #  max_iterations places a hard limit on the number of iterations that the
  #  algorithm is allowed.  The actual number of iterations is controlled by the
  #  parameter delta_reconstruction_error.  The iterations stop when the change in
  #  the total "reconstruction error" from one iteration to the next is smaller than
  #  the value specified by delta_reconstruction_error.

  #  Next, you must get the module to read the data for clustering:

  $clusterer->get_data_from_csv();

  #  Finally, you invoke linear manifold clustering by:

  my $clusters = $clusterer->linear_manifold_clusterer();

  #  The value returned by this call is a reference to an array of anonymous arrays,
  #  with each anonymous array holding one cluster.  If you wish, you can have the
  #  module write the clusters to individual files by the following call:

  $clusterer->write_clusters_to_files($clusters);

  #  If you want to see how the reconstruction error changes with the iterations, you
  #  can make the call:

  $clusterer->display_reconstruction_errors_as_a_function_of_iterations();

  #  When your data is 3-dimensional and when the clusters reside on a surface that
  #  is more or less spherical, you can visualize the clusters by calling

  $clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

  #  where the first argument is a label to be displayed in the 3D plot and the
  #  second argument the value returned by calling linear_manifold_clusterer().

  #  SYNTHETIC DATA GENERATION:

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

=over 4

=item B<new():>

    my $clusterer = Algorithm::LinearManifoldDataClusterer->new(
                                        datafile                    => $datafile,
                                        mask                        => $mask,
                                        K                           => $K,
                                        P                           => $P,     
                                        cluster_search_multiplier   => $C,
                                        max_iterations              => $max_iter,
                                        delta_reconstruction_error  => 0.001,
                                        terminal_output             => 1,
                                        write_clusters_to_files     => 1,
                                        visualize_each_iteration    => 1,
                                        show_hidden_in_3D_plots     => 1,
                                        make_png_for_each_iteration => 1,
                    );

A call to C<new()> constructs a new instance of the
C<Algorithm::LinearManifoldDataClusterer> class.

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

the minimization of the reconstruction error --- is sensitive to the choice of the
cluster seeds that are selected randomly at the beginning of the algorithm.  Should
it happen that the seeds miss one or more of the clusters, the clustering produced is
likely to not be correct.  By giving an integer value to C<cluster_search_multiplier>
that is greater than 1, you'll increase the odds that the randomly selected seeds
will see all clusters.  When you set C<cluster_search_multiplier> to C<M>, you ask
Phase 1 of the algorithm to construct C<M*K> clusters as opposed to just C<K>
clusters. Subsequently, in Phase 2, the module uses inter-cluster similarity based
graph partitioning to merge the C<M*K> clusters into C<K> clusters.

=item C<max_iterations>:

This hard limits the number of iterations in Phase 1 of the algorithm.  Ordinarily,
the iterations stop automatically when the change in the total reconstruction error
from one iteration to the next is less than the value specified by the parameter
C<delta_reconstruction_error>.

=item C<delta_reconstruction_error>:

It is this parameter that determines when the iterations will actually terminate in
Phase 1 of the algorithm.  When the difference in the total reconstruction error from
one iteration to the next is less than the value given to this parameter, the
iterations come to an end. B<IMPORTANT: I have noticed that the larger the number of
data samples that need to be clustered, the larger must be the value give to this
parameter.  That makes intuitive sense since the total reconstruction error is the
sum of all such errors for all of the data elements.> Unfortunately, the best value
for this parameter does NOT appear to depend linearly on the total number of data
records to be clustered.

=item C<terminal_output>:

When this parameter is set, you will see in your terminal window the different
clusters as lists of the symbolic tags associated with the data records.  You will

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

    or 

    my $clusters = $clusterer->linear_manifold_clusterer();

This is the main call to the linear-manifold based clusterer.  The first call works
by side-effect, meaning that you will see the clusters in your terminal window and
they would be written out to disk files (depending on the constructor options you
have set).  The second call also returns the clusters as a reference to an array of
anonymous arrays, each holding the symbolic tags for a cluster.

=item B<display_reconstruction_errors_as_a_function_of_iterations()>:

    $clusterer->display_reconstruction_errors_as_a_function_of_iterations();

This method would normally be called after the clustering is completed to see how the
reconstruction errors decreased with the iterations in Phase 1 of the overall
algorithm.

=item B<write_clusters_to_files()>:

    $clusterer->write_clusters_to_files($clusters);

As its name implies, when you call this methods, the final clusters would be written
out to disk files.  The files have names like:

     cluster0.txt 

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

    ...
    ...

in the directory in which you execute the module.  The number of such files will
equal the number of clusters formed.  All such existing files in the directory are
destroyed before any fresh ones are created.  Each cluster file contains the symbolic
tags of the data samples in that cluster.

Assuming that the data dimensionality is 3, if you have set the constructor parameter
C<visualize_each_iteration>, the module will deposit in your directory printable PNG
files that are point plots of the different clusters in the different iterations of
the algorithm.  Such printable files are also generated for the initial clusters at
the beginning of the iterations and for the final clusters in Phase 1 of the
algorithm.  You will also see in your directory a PNG file for the clustering result
produced by graph partitioning in Phase 2.

Also, as mentioned previously, a call to C<linear_manifold_clusterer()> in your own
code will return the clusters to you directly.

=head1 REQUIRED

This module requires the following modules:

t/test.t  view on Meta::CPAN



# Test 2 (Linear-Manifold Based Clustering):

my $mask = "N111";
my $clusterer = Algorithm::LinearManifoldDataClusterer->new( 
                                     datafile => $datafile,
                                     mask     => "N111",
                                     K        => 3,
                                     P        => 2,
                                     max_iterations => 1,
                                     cluster_search_multiplier => 1,
                                     terminal_output => 0,
                                     visualize_each_iteration => 0,
                                     show_hidden_in_3D_plots => 0,
                                     make_png_for_each_iteration => 0,
                );
$clusterer->get_data_from_csv();
my $clusters = $clusterer->auto_retry_clusterer();
ok( @$clusters == 3,  'Clustering works' );



( run in 0.929 second using v1.01-cache-2.11-cpan-71847e10f99 )