view release on metacpan or search on metacpan
examples/README view on Meta::CPAN
clusters, the clusters look very different in the two data files. The clusters are
much more spread out in 3_clusters_on_a_sphere_3000_samples.csv.
The code in example4.pl is special because it shows how you can call the
auto_retry_clusterer() method of the module for automatic repeated invocation of the
clustering program until success is achieved. As currently programmed, the value of
the constructor parameter cluster_search_multiplier is set to 1, which means that
Phase 1 of the algorithm will only look for K clusters and there would be no need to
invoke Phase II of the algorithm for merging subclusters.
This directory also includes PNG image files that show some intermediate and the best
final results that can be achieved by the first three examples, that is, for the
scripts example1.pl, example2.pl, and example3.pl. If you are on a Linux machine and
if you have the ImageMagick library installed, you can use the `display' command to
see the results in the PNG images. The results you get with your own runs of the
three example scripts may or may not look like what you see in the outputs shown in
the PNG files depending on how the module seeds the clusters. Your best results
should look like what you see in the PNG images.
This directory also contains the following utility scripts:
For generating the data for experiments with clustering:
generate_data_on_a_sphere.pl
For visualizing the data generated by the above script:
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
my $trailing_eigenvec_matrix = Math::GSL::Matrix->new($self->{_data_dimensions},
scalar(@trailing_eigenvecs));
foreach my $j (0..@trailing_eigenvecs-1) {
print "trailing eigenvec column: @{$trailing_eigenvecs[$j]}\n" if $self->{_debug};
$trailing_eigenvec_matrix->set_col($j, $trailing_eigenvecs[$j]);
}
push @trailing_eigenvec_matrices_for_all_subspaces,$trailing_eigenvec_matrix;
push @reference_vecs_for_all_subspaces, $mean;
}
$self->set_termination_reconstruction_error_threshold(\@reference_vecs_for_all_subspaces);
my %best_subspace_based_partition_of_data;
foreach my $i (0..$self->{_KM}-1) {
$best_subspace_based_partition_of_data{$i} = [];
}
foreach my $data_tag (@{$self->{_data_tags}}) {
my $data_vec = Math::GSL::Matrix->new($self->{_data_dimensions},1);
$data_vec->set_col(0, $self->{_data_hash}->{$data_tag});
my @errors = map {$self->reconstruction_error($data_vec,
$trailing_eigenvec_matrices_for_all_subspaces[$_],
$reference_vecs_for_all_subspaces[$_])}
0 .. $self->{_KM}-1;
my ($minval, $index_for_closest_subspace) = minimum(\@errors);
$total_reconstruction_error_this_iteration += $minval;
push @{$best_subspace_based_partition_of_data{$index_for_closest_subspace}},
$data_tag;
}
print "Finished calculating the eigenvectors for the clusters produced by the previous\n" .
"iteration and re-assigning the data samples to the new subspaces on the basis of\n".
"the least reconstruction error.\n\n" .
"Total reconstruction error in this iteration: $total_reconstruction_error_this_iteration\n"
if $self->{_terminal_output};
foreach my $i (0..$self->{_KM}-1) {
$clusters[$i] = $best_subspace_based_partition_of_data{$i};
}
display_clusters(\@clusters) if $self->{_terminal_output};
# Check if any cluster has lost all its elements. If so, fragment the worst
# existing cluster to create the additional clusters needed:
if (any {@$_ == 0} @clusters) {
die "empty cluster found" if $self->{_auto_retry_flag};
print "\nOne or more clusters have become empty. Will carve out the needed clusters\n" .
"from the cluster with the largest subspace construction error.\n\n";
$total_reconstruction_error_this_iteration = 0;
@subspace_construction_errors_this_iteration = ();
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
my $subspace_construction_error = reduce {abs($a) + abs($b)} @trailing_eigenvals;
push @subspace_construction_errors_this_iteration, $subspace_construction_error;
my $trailing_eigenvec_matrix = Math::GSL::Matrix->new($self->{_data_dimensions},
scalar(@trailing_eigenvecs));
foreach my $j (0..@trailing_eigenvecs-1) {
$trailing_eigenvec_matrix->set_col($j, $trailing_eigenvecs[$j]);
}
push @trailing_eigenvec_matrices_for_all_subspaces,$trailing_eigenvec_matrix;
push @reference_vecs_for_all_subspaces, $mean;
}
my %best_subspace_based_partition_of_data;
foreach my $i (0..$self->{_KM}-1) {
$best_subspace_based_partition_of_data{$i} = [];
}
foreach my $data_tag (@{$self->{_data_tags}}) {
my $data_vec = Math::GSL::Matrix->new($self->{_data_dimensions},1);
$data_vec->set_col(0, $self->{_data_hash}->{$data_tag});
my @errors = map {reconstruction_error($data_vec,
$trailing_eigenvec_matrices_for_all_subspaces[$_],
$reference_vecs_for_all_subspaces[$_])}
0 .. $self->{_KM}-1;
my ($minval, $index_for_closest_subspace) = minimum(\@errors);
$total_reconstruction_error_this_iteration += $minval;
push @{$best_subspace_based_partition_of_data{$index_for_closest_subspace}},
$data_tag;
}
print "empty-cluster jag: total reconstruction error in this iteration: \n" .
"$total_reconstruction_error_this_iteration\n"
if $self->{_debug};
foreach my $i (0..$self->{_KM}-1) {
$clusters[$i] = $best_subspace_based_partition_of_data{$i};
}
display_clusters(\@newclusters) if $self->{_terminal_output};
@clusters = grep {@$_ != 0} @newclusters;
die "linear manifold based algorithm does not appear to work in this case $!"
unless @clusters == $self->{_KM};
}# end of foreach my $cluster (@clusters) ... loop followed by if clause for empty clusters
if ($self->{_data_dimensions} == 3) {
$visualization_msg = "clustering_at_iteration_$iteration_index";
$self->visualize_clusters_on_sphere($visualization_msg, \@clusters)
if $self->{_visualize_each_iteration};
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
$self->{_final_clusters} = \@final_clusters;
my %data_clustered = map {$_ => 1} @data_clustered;
my @data_tags_not_clustered =
grep {$_} map {exists $data_clustered{$_} ? undef : $_} @{$self->{_data_tags}};
if ($self->{_terminal_output}) {
print "\n\nNot all data clustered. The most reliable clusters found by graph partitioning:\n";
display_clusters(\@final_clusters);
print "\n\nData not yet clustered:\n\n@data_tags_not_clustered\n";
}
if ($self->{_data_dimensions} == 3) {
$visualization_msg = "$self->{_K}_best_clusters_produced_by_graph_partitioning";
$self->visualize_clusters_on_sphere($visualization_msg, \@final_clusters)
if $self->{_visualize_each_iteration};
$self->visualize_clusters_on_sphere($visualization_msg, \@final_clusters, "png")
if $self->{_make_png_for_each_iteration};
}
my %data_tags_to_cluster_label_hash;
foreach my $i (0..@final_clusters-1) {
map {$data_tags_to_cluster_label_hash{$_} = $i} @{$final_clusters[$i]};
}
$self->{_data_tags_to_cluster_label_hash} = \%data_tags_to_cluster_label_hash;
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
foreach my $i (0..@sorted_eigenvec_indexes-1) {
$sorted_eigenvecs[$i] = $all_my_eigenvecs[$sorted_eigenvec_indexes[$i]];
$sorted_eigenvals[$i] = $eigenvalues->[$sorted_eigenvec_indexes[$i]];
}
if ($self->{_debug2}) {
print "\nHere come sorted eigenvectors for A --- from the largest to the smallest:\n";
foreach my $i (0..@sorted_eigenvecs-1) {
print "eigenvec: @{$sorted_eigenvecs[$i]} eigenvalue: $sorted_eigenvals[$i]\n";
}
}
my $best_partitioning_eigenvec = $sorted_eigenvecs[@sorted_eigenvec_indexes-2];
print "\nBest graph partitioning eigenvector: @$best_partitioning_eigenvec\n" if $self->{_terminal_output};
my $how_many_positive = reduce {$a + $b} map {$_ > 0 ? 1 : 0 } @$best_partitioning_eigenvec;
my $how_many_negative = scalar(@$best_partitioning_eigenvec) - $how_many_positive;
print "Have $how_many_positive positive and $how_many_negative negative elements in the partitioning vec\n"
if $self->{_terminal_output};
if ($how_many_clusters_looking_for <= 3) {
my @merged_cluster;
my $final_cluster;
my @newclusters;
if ($how_many_positive == 1) {
foreach my $i (0..@$clusters-1) {
if ($best_partitioning_eigenvec->[$i] > 0) {
$final_cluster = $clusters->[$i];
} else {
push @newclusters, $clusters->[$i];
}
}
return ([$final_cluster], \@newclusters);
} elsif ($how_many_negative == 1) {
foreach my $i (0..@$clusters-1) {
if ($best_partitioning_eigenvec->[$i] < 0) {
$final_cluster = $clusters->[$i];
} else {
push @newclusters, $clusters->[$i];
}
}
return ([$final_cluster], \@newclusters);
} elsif ($how_many_positive <= $self->{_cluster_search_multiplier}) {
foreach my $i (0..@$clusters-1) {
if ($best_partitioning_eigenvec->[$i] > 0) {
push @merged_cluster, @{$clusters->[$i]};
} else {
push @newclusters, $clusters->[$i];
}
}
return ([\@merged_cluster], \@newclusters);
} elsif ($how_many_negative <= $self->{_cluster_search_multiplier}) {
foreach my $i (0..@$clusters-1) {
if ($best_partitioning_eigenvec->[$i] < 0) {
push @merged_cluster, @{$clusters->[$i]};
} else {
push @newclusters, $clusters->[$i];
}
}
return ([\@merged_cluster], \@newclusters);
} else {
die "\n\nBailing out!\n\n" .
"No consensus support for dominant clusters in the graph partitioning step\n" .
"of the algorithm. This can be caused by bad random selection of initial\n" .
"cluster centers. Please run this program again.\n";
}
} else {
my @positive_clusters;
my @negative_clusters;
foreach my $i (0..@$clusters-1) {
if ($best_partitioning_eigenvec->[$i] > 0) {
push @positive_clusters, $clusters->[$i];
} else {
push @negative_clusters, $clusters->[$i];
}
}
return (\@positive_clusters, \@negative_clusters);
}
}
sub pairwise_cluster_similarity {
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
} else {
%intersection_hash = map {$_ => 1} grep {$tag_hash_for_this_dimen{$_}}
keys %intersection_hash;
}
}
my @intersection_set = keys %intersection_hash;
my $cluster_unimodality_index = scalar(@intersection_set) / scalar(@$cluster);
return $cluster_unimodality_index;
}
sub find_best_ref_vector {
my $self = shift;
my $cluster = shift;
my $trailing_eigenvec_matrix = shift;
my $mean = shift; # a GSL marix ref
my @min_bounds;
my @max_bounds;
my @ranges;
foreach my $dimen (0..$self->{_data_dimensions}-1) {
my @values = map {$_->[$dimen]} map {$self->{_data_hash}->{$_}} @$cluster;
my ($min, $max) = (List::Util::min(@values), List::Util::max(@values));
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
}
my ($min,$max) = minmax(\@cluster_center_indices );
my $quality = ($max - $min) / $data_store_size;
last if $quality > 0.3;
}
return @cluster_center_indices;
}
# The purpose of this routine is to form initial clusters by assigning the data
# samples to the initial clusters formed by the previous routine on the basis of the
# best proximity of the data samples to the different cluster centers.
sub assign_data_to_clusters_initial {
my $self = shift;
my @cluster_centers = @{ shift @_ };
my @clusters;
foreach my $ele (@{$self->{_data_tags}}) {
my $best_cluster;
my @dist_from_clust_centers;
foreach my $center (@cluster_centers) {
push @dist_from_clust_centers, $self->distance($ele, $center);
}
my ($min, $best_center_index) = minimum( \@dist_from_clust_centers );
push @{$clusters[$best_center_index]}, $ele;
}
return \@clusters;
}
# The following routine is for computing the distance between a data point specified
# by its symbolic name in the master datafile and a point (such as the center of a
# cluster) expressed as a vector of coordinates:
sub distance {
my $self = shift;
my $ele1_id = shift @_; # symbolic name of data sample
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
hyperplane through the surface on which the data actually resides. As to why that
might be a bad thing to do, just imagine the consequences of assuming that your data
falls on a straight line when, in reality, it falls on a strongly curving arc. This
is exactly what happens with PCA --- it gives you a linear manifold approximation to
your data that may actually reside on a curved surface.
That brings us to the purpose of this module, which is to cluster data that resides
on a nonlinear manifold. Since a nonlinear manifold is locally linear, we can think
of each data cluster on a nonlinear manifold as falling on a locally linear portion
of the manifold, meaning on a hyperplane. The logic of the module is based on
finding a set of hyperplanes that best describes the data, with each hyperplane
derived from a local data cluster. This is like constructing a piecewise linear
approximation to data that falls on a curve as opposed to constructing a single
straight line approximation to all of the data. So whereas the frequently used PCA
algorithm gives you a single hyperplane approximation to all your data, what this
module returns is a set of hyperplane approximations, with each hyperplane derived by
applying the PCA algorithm locally to a data cluster.
That brings us to the problem of how to actually discover the best set of hyperplane
approximations to the data. What is probably the most popular algorithm today for
that purpose is based on the following key idea: Given a set of subspaces to which a
data element can be assigned, you assign it to that subspace for which the
B<reconstruction error> is the least. But what do we mean by a B<subspace> and what
is B<reconstruction error>?
To understand the notions of B<subspace> and B<reconstruction-error>, let's revisit
the traditional approach of dimensionality reduction by the PCA algorithm. The PCA
algorithm consists of: (1) Subtracting from each data element the global mean of the
data; (2) Calculating the covariance matrix of the data; (3) Carrying out an
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
I<In linear-manifold based machine learning, instead of constructing a single
subspace in the manner described above, we construct a set of subspaces, one for each
data cluster on the nonlinear manifold. After the subspaces have been constructed, a
data element is assigned to that subspace for which the reconstruction error is the
least.> On the face of it, this sounds like a chicken-and-egg sort of a problem. You
need to have already clustered the data in order to construct the subspaces at
different places on the manifold so that you can figure out which cluster to place a
data element in.
Such problems, when they do possess a solution, are best tackled through iterative
algorithms in which you start with a guess for the final solution, you rearrange the
measured data on the basis of the guess, and you then use the new arrangement of the
data to refine the guess. Subsequently, you iterate through the second and the third
steps until you do not see any discernible changes in the new arrangements of the
data. This forms the basis of the clustering algorithm that is described under
B<Phase 1> in the section that follows. This algorithm was first proposed in the
article "Dimension Reduction by Local Principal Component Analysis" by Kambhatla and
Leen that appeared in the journal Neural Computation in 1997.
Unfortunately, experiments show that the algorithm as proposed by Kambhatla and Leen
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
C<delta_reconstruction_error>.
=item C<delta_reconstruction_error>:
It is this parameter that determines when the iterations will actually terminate in
Phase 1 of the algorithm. When the difference in the total reconstruction error from
one iteration to the next is less than the value given to this parameter, the
iterations come to an end. B<IMPORTANT: I have noticed that the larger the number of
data samples that need to be clustered, the larger must be the value give to this
parameter. That makes intuitive sense since the total reconstruction error is the
sum of all such errors for all of the data elements.> Unfortunately, the best value
for this parameter does NOT appear to depend linearly on the total number of data
records to be clustered.
=item C<terminal_output>:
When this parameter is set, you will see in your terminal window the different
clusters as lists of the symbolic tags associated with the data records. You will
also see in your terminal window the output produced by the different steps of the
graph partitioning algorithm as smaller clusters are merged to form the final C<K>
clusters --- assuming that you set the parameter C<cluster_search_multiplier> to an
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
The code in C<example4.pl> is special because it shows how you can call the
C<auto_retry_clusterer()> method of the module for automatic repeated invocations of
the clustering program until success is achieved. The value of the constructor
parameter C<cluster_search_multiplier> is set to 1 in C<example4.pl>, implying that
when you execute C<example4.pl> you will not be invoking Phase 2 of the algorithm.
You might wish to change the value given to the parameter
C<cluster_search_multiplier> to see how it affects the number of attempts needed to
achieve success.
The C<examples> directory also includes PNG image files that show some intermediate
and the best final results that can be achieved by the first three examples, that
is, for the scripts C<example1.pl>, C<example2.pl>, and C<example3.pl>. If you are on
a Linux machine and if you have the C<ImageMagick> library installed, you can use the
C<display> command to see the results in the PNG images. The results you get with
your own runs of the three example scripts may or may not look like what you see in
the outputs shown in the PNG files depending on how the module seeds the clusters.
Your best results should look like what you see in the PNG images.
The C<examples> directory also contains the following utility scripts:
For generating the data for experiments with clustering:
generate_data_on_a_sphere.pl
For visualizing the data generated by the above script:
data_visualizer.pl
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
that are generated by the module when you run it with the constructor option
C<make_png_for_each_iteration> set to 1.
=head1 EXPORT
None by design.
=head1 CAVEATS
The performance of the algorithm depends much on the values you choose for the
constructor parameters. And, even for the best choices for the parameters, the
algorithm is not theoretically guaranteed to return the best results.
Even after you have discovered the best choices for the constructor parameters, the
best way to use this module is to try it multiple times on any given data and accept
only those results that make the best intuitive sense.
The module was designed with the hope that it would rather fail than give you wrong
results. So if you see it failing a few times before it returns a good answer, that's
a good thing.
However, if the module fails too often or is too quick to give you wrong answers,
that means the module is not working on your data. If that happens, I'd love to hear
from you. That might indicate to me how I should enhance the power of this module
for its future versions.