Algorithm-LinearManifoldDataClusterer
view release on metacpan or search on metacpan
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
}
}
}
sub linear_manifold_clusterer {
my $self = shift;
my $KM = $self->{_KM};
my @initial_cluster_center_tags;
my $visualization_msg;
my @initial_cluster_center_indexes = $self->initialize_cluster_centers($KM, $self->{_N});
print "Randomly selected indexes for cluster center tags: @initial_cluster_center_indexes\n"
if $self->{_debug};
@initial_cluster_center_tags = map {$self->{_data_tags}->[$_]} @initial_cluster_center_indexes;
my @initial_cluster_center_coords = map {$self->{_data_hash}->{$_}} @initial_cluster_center_tags;
if ($self->{_debug}) {
foreach my $centroid (@initial_cluster_center_coords) {
print "Initial cluster center coords: @{$centroid}\n";
}
}
my $initial_clusters = $self->assign_data_to_clusters_initial(\@initial_cluster_center_coords);
if ($self->{_data_dimensions} == 3) {
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
if ($best_partitioning_eigenvec->[$i] < 0) {
push @merged_cluster, @{$clusters->[$i]};
} else {
push @newclusters, $clusters->[$i];
}
}
return ([\@merged_cluster], \@newclusters);
} else {
die "\n\nBailing out!\n\n" .
"No consensus support for dominant clusters in the graph partitioning step\n" .
"of the algorithm. This can be caused by bad random selection of initial\n" .
"cluster centers. Please run this program again.\n";
}
} else {
my @positive_clusters;
my @negative_clusters;
foreach my $i (0..@$clusters-1) {
if ($best_partitioning_eigenvec->[$i] > 0) {
push @positive_clusters, $clusters->[$i];
} else {
push @negative_clusters, $clusters->[$i];
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
my $reconstruction_error_for_a_ref_vec = 0;
foreach my $data_tag (@{$self->{_data_tags}}) {
my $data_vec = Math::GSL::Matrix->new($self->{_data_dimensions},1);
$data_vec->set_col(0, $self->{_data_hash}->{$data_tag});
my $error = $self->reconstruction_error($data_vec,$trailing_eigenvec_matrix,$ref_vec);
$reconstruction_error_for_a_ref_vec += $error;
}
push @reconstruction_errors, $reconstruction_error_for_a_ref_vec;
}
my $recon_error_for_original_mean = shift @reconstruction_errors;
my $smallest_error_randomly_selected_ref_vecs = List::Util::min(@reconstruction_errors);
my $minindex = List::Util::first { $_ == $smallest_error_randomly_selected_ref_vecs }
@reconstruction_errors;
my $refvec = $random_points[$minindex];
return $refvec;
}
## The reconstruction error relates to the size of the perpendicular from a data
## point X to the hyperplane that defines a given subspace on the manifold.
sub reconstruction_error {
my $self = shift;
my $data_vec = shift;
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
B<Phase 1> in the section that follows. This algorithm was first proposed in the
article "Dimension Reduction by Local Principal Component Analysis" by Kambhatla and
Leen that appeared in the journal Neural Computation in 1997.
Unfortunately, experiments show that the algorithm as proposed by Kambhatla and Leen
is much too sensitive to how the clusters are seeded initially. To get around this
limitation of the basic clustering-by-minimization-of-reconstruction-error, this
module implements a two phased approach. In B<Phase 1>, we introduce a multiplier
effect in our search for clusters by looking for C<M*K> clusters instead of the main
C<K> clusters. In this manner, we increase the odds that each original cluster will
be visited by one or more of the C<M*K> randomly selected seeds at the beginning,
where C<M> is the integer value given to the constructor parameter
C<cluster_search_multiplier>. Subsequently, we merge the clusters that belong
together in order to form the final C<K> clusters. That work is done in B<Phase 2>
of the algorithm.
For the cluster merging operation in Phase 2, we model the C<M*K> clusters as the
nodes of an attributed graph in which the weight given to an edge connecting a pair
of nodes is a measure of the similarity between the two clusters corresponding to the
two nodes. Subsequently, we use spectral clustering to merge the most similar nodes
in our quest to partition the data into C<K> clusters. For that purpose, we use the
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
=item C<P>:
This parameter specifies the dimensionality of the manifold on which the data resides.
=item C<cluster_search_multiplier>:
As should be clear from the C<Summary of the Algorithm> section, this parameter plays
a very important role in the successful clustering of your data. As explained in
C<Description>, the basic algorithm used for clustering in Phase 1 --- clustering by
the minimization of the reconstruction error --- is sensitive to the choice of the
cluster seeds that are selected randomly at the beginning of the algorithm. Should
it happen that the seeds miss one or more of the clusters, the clustering produced is
likely to not be correct. By giving an integer value to C<cluster_search_multiplier>
that is greater than 1, you'll increase the odds that the randomly selected seeds
will see all clusters. When you set C<cluster_search_multiplier> to C<M>, you ask
Phase 1 of the algorithm to construct C<M*K> clusters as opposed to just C<K>
clusters. Subsequently, in Phase 2, the module uses inter-cluster similarity based
graph partitioning to merge the C<M*K> clusters into C<K> clusters.
=item C<max_iterations>:
This hard limits the number of iterations in Phase 1 of the algorithm. Ordinarily,
the iterations stop automatically when the change in the total reconstruction error
from one iteration to the next is less than the value specified by the parameter
lib/Algorithm/LinearManifoldDataClusterer.pm view on Meta::CPAN
$training_data_gen->gen_data_and_write_to_csv();
As its name implies, this method generates the data according to the parameters
specified in the DataGenerator constructor.
=item B<visualize_data_on_sphere()>:
$training_data_gen->visualize_data_on_sphere($output_file);
You can use this method to visualize the clusters produced by the data generator.
Since the clusters are located at randomly selected points on a unit sphere, by
looking at the output visually, you can quickly reject what the data generator has
produced and try again.
=back
=head1 HOW THE CLUSTERS ARE OUTPUT
When the option C<terminal_output> is set in the constructor of the
C<LinearManifoldDataClusterer> class, the clusters are displayed on the terminal
screen.
( run in 0.299 second using v1.01-cache-2.11-cpan-94b05bcf43c )