Algorithm-LinearManifoldDataClusterer

 view release on metacpan or  search on metacpan

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

    my @cluster_center_indices;
    while (1) {
        foreach my $i (0..$K-1) {
            $cluster_center_indices[$i] = int rand( $data_store_size );
            next if $i == 0;
            foreach my $j (0..$i-1) {
                while ( $cluster_center_indices[$j] == $cluster_center_indices[$i] ) {
                    my $old = $cluster_center_indices[$i];
                    $cluster_center_indices[$i] = int rand($data_store_size);
                }
            }
        }
        my ($min,$max) = minmax(\@cluster_center_indices );
        my $quality = ($max - $min) / $data_store_size;
        last if $quality > 0.3;
    }
    return @cluster_center_indices;
}

# The purpose of this routine is to form initial clusters by assigning the data
# samples to the initial clusters formed by the previous routine on the basis of the
# best proximity of the data samples to the different cluster centers.
sub assign_data_to_clusters_initial {
    my $self = shift;
    my @cluster_centers = @{ shift @_ };
    my @clusters;
    foreach my $ele (@{$self->{_data_tags}}) {
        my $best_cluster;
        my @dist_from_clust_centers;
        foreach my $center (@cluster_centers) {
            push @dist_from_clust_centers, $self->distance($ele, $center);
        }
        my ($min, $best_center_index) = minimum( \@dist_from_clust_centers );
        push @{$clusters[$best_center_index]}, $ele;
    }
    return \@clusters;
}    

# The following routine is for computing the distance between a data point specified
# by its symbolic name in the master datafile and a point (such as the center of a
# cluster) expressed as a vector of coordinates:
sub distance {
    my $self = shift;
    my $ele1_id = shift @_;            # symbolic name of data sample
    my @ele1 = @{$self->{_data_hash}->{$ele1_id}};
    my @ele2 = @{shift @_};
    die "wrong data types for distance calculation\n" if @ele1 != @ele2;
    my $how_many = @ele1;
    my $squared_sum = 0;
    foreach my $i (0..$how_many-1) {
        $squared_sum += ($ele1[$i] - $ele2[$i])**2;
    }    
    my $dist = sqrt $squared_sum;
    return $dist;
}

sub write_clusters_to_files {
    my $self = shift;
    my $clusters = shift;
    my @clusters = @$clusters;
    unlink glob "cluster*.txt";
    foreach my $i (0..@clusters-1) {
        my $filename = "cluster" . $i . ".txt";
        print "\nWriting cluster $i to file $filename\n" if $self->{_terminal_output};
        open FILEHANDLE, "| sort > $filename"  or die "Unable to open file: $!";
        foreach my $ele (@{$clusters[$i]}) {        
            print FILEHANDLE "$ele ";
        }
        close FILEHANDLE;
    }
}

sub DESTROY {
    my $filename = basename($_[0]->{_datafile});
    $filename =~ s/\.\w+$/\.txt/;
    unlink "__temp_" . $filename;
}

##################################  Visualization Code ###################################

sub add_point_coords {
    my $self = shift;
    my @arr_of_ids = @{shift @_};      # array of data element names
    my @result;
    my $data_dimensionality = $self->{_data_dimensions};
    foreach my $i (0..$data_dimensionality-1) {
        $result[$i] = 0.0;
    }
    foreach my $id (@arr_of_ids) {
        my $ele = $self->{_data_hash}->{$id};
        my $i = 0;
        foreach my $component (@$ele) {
            $result[$i] += $component;
            $i++;
        }
    }
    return \@result;
}

# This is the main module version:
sub visualize_clusters_on_sphere {
    my $self = shift;
    my $visualization_msg = shift;
    my $clusters = deep_copy_AoA(shift);
    my $hardcopy_format = shift;
    my $pause_time = shift;
    my $d = $self->{_data_dimensions};
    my $temp_file = "__temp_" . $self->{_datafile};
    $temp_file =~ s/\.\w+$/\.txt/;
    unlink $temp_file if -e $temp_file;
    open OUTPUT, ">$temp_file"
           or die "Unable to open a temp file in this directory: $!";
    my @all_tags = "A".."Z";
    my @retagged_clusters;
    foreach my $cluster (@$clusters) {
        my $label = shift @all_tags;
        my @retagged_cluster = 
           map {$_ =~ s/^(\w+?)_(\w+)/$label . "_$2 @{$self->{_data_hash}->{$_}}"/e;$_} @$cluster;
        push @retagged_clusters, \@retagged_cluster;
    }
    my %clusters;

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

Version 1.01: Typos and other errors removed in the documentation. Also included in
the documentation a link to a tutorial on data processing on manifolds.


=head1 DESCRIPTION

If you are new to machine learning and data clustering on linear and nonlinear
manifolds, your first question is likely to be: What is a manifold?  A manifold is a
space that is locally Euclidean. And a space is locally Euclidean if it allows for
the points in a small neighborhood to be represented by, say, the Cartesian
coordinates and if the distances between the points in the neighborhood are given by
the Euclidean metric.  For an example, the set of all points on the surface of a
sphere does NOT constitute a Euclidean space.  Nonetheless, if you confined your
attention to a small enough neighborhood around a point, the space would seem to be
locally Euclidean.  The surface of a sphere is a 2-dimensional manifold embedded in a
3-dimensional space.  A plane in a 3-dimensional space is also a 2-dimensional
manifold. You would think of the surface of a sphere as a nonlinear manifold, whereas
a plane would be a linear manifold.  However, note that any nonlinear manifold is
locally a linear manifold.  That is, given a sufficiently small neighborhood on a
nonlinear manifold, you can always think of it as a locally flat surface.

As to why we need machine learning and data clustering on manifolds, there exist many
important applications in which the measured data resides on a nonlinear manifold.
For example, when you record images of a human face from different angles, all the
image pixels taken together fall on a low-dimensional surface in a high-dimensional
measurement space. The same is believed to be true for the satellite images of a land
mass that are recorded with the sun at different angles with respect to the direction
of the camera.

Reducing the dimensionality of the sort of data mentioned above is critical to the
proper functioning of downstream classification algorithms, and the most popular
traditional method for dimensionality reduction is the Principal Components Analysis
(PCA) algorithm.  However, using PCA is tantamount to passing a linear least-squares
hyperplane through the surface on which the data actually resides.  As to why that
might be a bad thing to do, just imagine the consequences of assuming that your data
falls on a straight line when, in reality, it falls on a strongly curving arc.  This
is exactly what happens with PCA --- it gives you a linear manifold approximation to
your data that may actually reside on a curved surface.

That brings us to the purpose of this module, which is to cluster data that resides
on a nonlinear manifold.  Since a nonlinear manifold is locally linear, we can think
of each data cluster on a nonlinear manifold as falling on a locally linear portion
of the manifold, meaning on a hyperplane.  The logic of the module is based on
finding a set of hyperplanes that best describes the data, with each hyperplane
derived from a local data cluster.  This is like constructing a piecewise linear
approximation to data that falls on a curve as opposed to constructing a single
straight line approximation to all of the data.  So whereas the frequently used PCA
algorithm gives you a single hyperplane approximation to all your data, what this
module returns is a set of hyperplane approximations, with each hyperplane derived by
applying the PCA algorithm locally to a data cluster.

That brings us to the problem of how to actually discover the best set of hyperplane
approximations to the data.  What is probably the most popular algorithm today for
that purpose is based on the following key idea: Given a set of subspaces to which a
data element can be assigned, you assign it to that subspace for which the
B<reconstruction error> is the least.  But what do we mean by a B<subspace> and what
is B<reconstruction error>?

To understand the notions of B<subspace> and B<reconstruction-error>, let's revisit
the traditional approach of dimensionality reduction by the PCA algorithm.  The PCA
algorithm consists of: (1) Subtracting from each data element the global mean of the
data; (2) Calculating the covariance matrix of the data; (3) Carrying out an
eigendecomposition of the covariance matrix and ordering the eigenvectors according
to decreasing values of the corresponding eigenvalues; (4) Forming a B<subspace> by
discarding the trailing eigenvectors whose corresponding eigenvalues are relatively
small; and, finally, (5) projecting all the data elements into the subspace so
formed. The error incurred in representing a data element by its projection into the
subspace is known as the B<reconstruction error>.  This error is the projection of
the data element into the space spanned by the discarded trailing eigenvectors.

I<In linear-manifold based machine learning, instead of constructing a single
subspace in the manner described above, we construct a set of subspaces, one for each
data cluster on the nonlinear manifold.  After the subspaces have been constructed, a
data element is assigned to that subspace for which the reconstruction error is the
least.> On the face of it, this sounds like a chicken-and-egg sort of a problem.  You
need to have already clustered the data in order to construct the subspaces at
different places on the manifold so that you can figure out which cluster to place a
data element in.

Such problems, when they do possess a solution, are best tackled through iterative
algorithms in which you start with a guess for the final solution, you rearrange the
measured data on the basis of the guess, and you then use the new arrangement of the
data to refine the guess.  Subsequently, you iterate through the second and the third
steps until you do not see any discernible changes in the new arrangements of the
data.  This forms the basis of the clustering algorithm that is described under
B<Phase 1> in the section that follows.  This algorithm was first proposed in the
article "Dimension Reduction by Local Principal Component Analysis" by Kambhatla and
Leen that appeared in the journal Neural Computation in 1997.

Unfortunately, experiments show that the algorithm as proposed by Kambhatla and Leen
is much too sensitive to how the clusters are seeded initially.  To get around this
limitation of the basic clustering-by-minimization-of-reconstruction-error, this
module implements a two phased approach.  In B<Phase 1>, we introduce a multiplier
effect in our search for clusters by looking for C<M*K> clusters instead of the main
C<K> clusters.  In this manner, we increase the odds that each original cluster will
be visited by one or more of the C<M*K> randomly selected seeds at the beginning,
where C<M> is the integer value given to the constructor parameter
C<cluster_search_multiplier>.  Subsequently, we merge the clusters that belong
together in order to form the final C<K> clusters.  That work is done in B<Phase 2>
of the algorithm.

For the cluster merging operation in Phase 2, we model the C<M*K> clusters as the
nodes of an attributed graph in which the weight given to an edge connecting a pair
of nodes is a measure of the similarity between the two clusters corresponding to the
two nodes.  Subsequently, we use spectral clustering to merge the most similar nodes
in our quest to partition the data into C<K> clusters.  For that purpose, we use the
Shi-Malik normalized cuts algorithm.  The pairwise node similarity required by this
algorithm is measured by the C<pairwise_cluster_similarity()> method of the
C<LinearManifoldDataClusterer> class.  The smaller the overall reconstruction error
when all of the data elements in one cluster are projected into the other's subspace
and vice versa, the greater the similarity between two clusters.  Additionally, the
smaller the distance between the mean vectors of the clusters, the greater the
similarity between two clusters.  The overall similarity between a pair of clusters
is a combination of these two similarity measures.

For additional information regarding the theoretical underpinnings of the algorithm
implemented in this module, visit
L<https://engineering.purdue.edu/kak/Tutorials/ClusteringDataOnManifolds.pdf>


=head1 SUMMARY OF THE ALGORITHM



( run in 5.373 seconds using v1.01-cache-2.11-cpan-97f6503c9c8 )