Algorithm-LinearManifoldDataClusterer

 view release on metacpan or  search on metacpan

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

        $ref_vec->set_col(0, $candidate_ref_vec);
        my $reconstruction_error_for_a_ref_vec = 0;
        foreach my $data_tag (@{$self->{_data_tags}}) {
            my $data_vec = Math::GSL::Matrix->new($self->{_data_dimensions},1);
            $data_vec->set_col(0, $self->{_data_hash}->{$data_tag});
            my $error = $self->reconstruction_error($data_vec,$trailing_eigenvec_matrix,$ref_vec);
            $reconstruction_error_for_a_ref_vec += $error;
        }
        push @reconstruction_errors, $reconstruction_error_for_a_ref_vec;
    }
    my $recon_error_for_original_mean = shift @reconstruction_errors;
    my $smallest_error_randomly_selected_ref_vecs = List::Util::min(@reconstruction_errors);
    my $minindex = List::Util::first { $_ == $smallest_error_randomly_selected_ref_vecs }
                                                    @reconstruction_errors;
    my $refvec = $random_points[$minindex];
    return $refvec;
}

##  The reconstruction error relates to the size of the perpendicular from a data
##  point X to the hyperplane that defines a given subspace on the manifold.
sub reconstruction_error {

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

data.  This forms the basis of the clustering algorithm that is described under
B<Phase 1> in the section that follows.  This algorithm was first proposed in the
article "Dimension Reduction by Local Principal Component Analysis" by Kambhatla and
Leen that appeared in the journal Neural Computation in 1997.

Unfortunately, experiments show that the algorithm as proposed by Kambhatla and Leen
is much too sensitive to how the clusters are seeded initially.  To get around this
limitation of the basic clustering-by-minimization-of-reconstruction-error, this
module implements a two phased approach.  In B<Phase 1>, we introduce a multiplier
effect in our search for clusters by looking for C<M*K> clusters instead of the main
C<K> clusters.  In this manner, we increase the odds that each original cluster will
be visited by one or more of the C<M*K> randomly selected seeds at the beginning,
where C<M> is the integer value given to the constructor parameter
C<cluster_search_multiplier>.  Subsequently, we merge the clusters that belong
together in order to form the final C<K> clusters.  That work is done in B<Phase 2>
of the algorithm.

For the cluster merging operation in Phase 2, we model the C<M*K> clusters as the
nodes of an attributed graph in which the weight given to an edge connecting a pair
of nodes is a measure of the similarity between the two clusters corresponding to the
two nodes.  Subsequently, we use spectral clustering to merge the most similar nodes



( run in 0.219 second using v1.01-cache-2.11-cpan-1c8d708658b )