Algorithm-LinearManifoldDataClusterer

 view release on metacpan or  search on metacpan

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

            print "\nApplying unimodality correction:\n\n" if $self->{_terminal_output};
            my @sorted_cluster_indexes = 
               sort {$cluster_unimodality_quotients[$b] <=> $cluster_unimodality_quotients[$a]} 0..@clusters-1;
            my @newclusters;
            foreach my $cluster_index (0..@clusters - 1) {
                push @newclusters, $clusters[$sorted_cluster_indexes[$cluster_index]];
            }
            @clusters = @newclusters;
            my $worst_cluster = pop @clusters;
            print "\nthe worst cluster: @$worst_cluster\n" if $self->{_terminal_output};
            my $second_worst_cluster = pop @clusters;
            print "\nthe second worst cluster: @$second_worst_cluster\n" if $self->{_terminal_output};
            push @$worst_cluster, @$second_worst_cluster;
            fisher_yates_shuffle($worst_cluster);
            my @first_half = @$worst_cluster[0 .. int(scalar(@$worst_cluster)/2) - 1];
            my @second_half = @$worst_cluster[int(scalar(@$worst_cluster)/2) .. @$worst_cluster - 1];
            push @clusters, \@first_half;
            push @clusters, \@second_half;
            if ($self->{_terminal_output}) {
                print "\n\nShowing the clusters obtained after applying the unimodality correction:\n";
                display_clusters(\@clusters);      
            }
        }
        if (@{$self->{_reconstruction_error_as_a_function_of_iteration}} > 0) {
            my $last_recon_error = pop @{$self->{_reconstruction_error_as_a_function_of_iteration}};
            push @{$self->{_reconstruction_error_as_a_function_of_iteration}}, $last_recon_error;
            if (($last_recon_error - $total_reconstruction_error_this_iteration) 
                                                        < $self->{_delta_normalized_error}) {

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

        my $range = $max - $min;
        my $mean_along_this_dimen = $mean[$dimen];
        my @tags =  grep {$_} 
                    map { ( ($self->{_data_hash}->{$_}->[$dimen] > $mean_along_this_dimen - $delta * $range) 
                            && 
                            ($self->{_data_hash}->{$_}->[$dimen] < $mean_along_this_dimen + $delta * $range) ) 
                          ? $_ : undef }
                    @$cluster; 
        push @data_tags_for_range_tests, \@tags;
    }
    # Now find the intersection of the tag sets for each of the dimensions
    my %intersection_hash;
    foreach my $dimen (0..$self->{_data_dimensions}-1) {
        my %tag_hash_for_this_dimen  = map {$_ => 1} @{$data_tags_for_range_tests[$dimen]};
        if ($dimen == 0) {
            %intersection_hash = %tag_hash_for_this_dimen;
        } else {
            %intersection_hash = map {$_ => 1} grep {$tag_hash_for_this_dimen{$_}} 
                                 keys %intersection_hash;
        }
    }
    my @intersection_set = keys %intersection_hash;
    my $cluster_unimodality_index = scalar(@intersection_set) / scalar(@$cluster);
    return $cluster_unimodality_index;
}

sub find_best_ref_vector {
    my $self = shift;
    my $cluster = shift;
    my $trailing_eigenvec_matrix = shift;
    my $mean = shift;        # a GSL marix ref
    my @min_bounds;
    my @max_bounds;

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

        }
        last if $found_match_flag == 0;
    }
    return $found_match_flag;
}

##  We first generate a set of points randomly on the unit sphere --- the number of
##  points being equal to the number of clusters desired.  These points will serve as
##  cluster means (or, as cluster centroids) subsequently when we ask
##  Math::Random::random_multivariate_normal($N, @m, @covar) to return $N number of
##  points on the sphere.  The second argument is the cluster mean and the third
##  argument the cluster covariance.  For the synthetic data, we set the cluster
##  covariance to a 2x2 diagonal matrix, with the (0,0) element corresponding to the
##  variance along the azimuth direction and the (1,1) element corresponding to the
##  variance along the elevation direction.
##
##  When you generate the points in the 2D spherical coordinates of
##  (azimuth,elevation), you also need `wrap-around' logic for those points yielded by
##  the multivariate-normal function whose azimuth angle is outside the interval
##  (0,360) and/or whose elevation angle is outside the interval (-90,90).
##
##  Note that the first of the two dimensions for which the multivariate-normal
##  function returns the points is for the azimuth angle and the second for the
##  elevation angle.
##
##  With regard to the relationship of the Cartesian coordinates to the spherical
##  (azimuth, elevation) coordinates, we assume that (x,y) is the horizontal plane
##  and z the vertical axis.  The elevation angle theta is measure with respect to
##  the XY-plane.  The highest point on the sphere (the Zenith) corresponds to the
##  elevation angle of +90 and the lowest points on the sphere (the Nadir)
##  corresponds to the elevation angle of -90.  The azimuth is measured with respect
##  X-axis.  The range of the azimuth is from 0 to 360 degrees.  The elevation is
##  measured from the XY plane and its range is (-90,90) degrees.

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

  #  can make the call:

  $clusterer->display_reconstruction_errors_as_a_function_of_iterations();

  #  When your data is 3-dimensional and when the clusters reside on a surface that
  #  is more or less spherical, you can visualize the clusters by calling

  $clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

  #  where the first argument is a label to be displayed in the 3D plot and the
  #  second argument the value returned by calling linear_manifold_clusterer().

  #  SYNTHETIC DATA GENERATION:

  #  The module includes an embedded class, DataGenerator, for generating synthetic
  #  three-dimensional data that can be used to experiment with the clustering code.
  #  The synthetic data, written out to a CSV file, consists of Gaussian clusters on
  #  the surface of a sphere.  You can control the number of clusters, the width of
  #  each cluster, and the number of samples in the clusters by giving appropriate
  #  values to the constructor parameters as shown below:

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

data cluster on the nonlinear manifold.  After the subspaces have been constructed, a
data element is assigned to that subspace for which the reconstruction error is the
least.> On the face of it, this sounds like a chicken-and-egg sort of a problem.  You
need to have already clustered the data in order to construct the subspaces at
different places on the manifold so that you can figure out which cluster to place a
data element in.

Such problems, when they do possess a solution, are best tackled through iterative
algorithms in which you start with a guess for the final solution, you rearrange the
measured data on the basis of the guess, and you then use the new arrangement of the
data to refine the guess.  Subsequently, you iterate through the second and the third
steps until you do not see any discernible changes in the new arrangements of the
data.  This forms the basis of the clustering algorithm that is described under
B<Phase 1> in the section that follows.  This algorithm was first proposed in the
article "Dimension Reduction by Local Principal Component Analysis" by Kambhatla and
Leen that appeared in the journal Neural Computation in 1997.

Unfortunately, experiments show that the algorithm as proposed by Kambhatla and Leen
is much too sensitive to how the clusters are seeded initially.  To get around this
limitation of the basic clustering-by-minimization-of-reconstruction-error, this
module implements a two phased approach.  In B<Phase 1>, we introduce a multiplier
effect in our search for clusters by looking for C<M*K> clusters instead of the main
C<K> clusters.  In this manner, we increase the odds that each original cluster will
be visited by one or more of the C<M*K> randomly selected seeds at the beginning,

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

define the subspace for that cluster.  Use the space spanned by the remaining
eigenvectors --- we refer to them as the trailing eigenvectors --- for calculating
the reconstruction error.

=item Step 4:

Taking into account the mean associated with each cluster, re-cluster the entire data
set on the basis of the least reconstruction error.  That is, assign each data
element to that subspace for which it has the smallest reconstruction error.
Calculate the total reconstruction error associated with all the data elements.  (See
the definition of the reconstruction error in the C<Description> section.)

=item Step 5:

Stop iterating if the change in the total reconstruction error from the previous 
iteration to the current iteration is less than the value specified by the constructor
parameter C<delta_reconstruction_error>.  Otherwise, go back to Step 3.

=back

=item B<Phase 2:>

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

distance between the mean vectors associated with the two clusters.

=item Step 3:

Calculate the symmetric normalized Laplacian of the similarity matrix.  We use C<A>
to denote the symmetric normalized Laplacian.

=item Step 4:

Carry out an eigendecomposition of the C<A> matrix and choose the eigenvector
corresponding to the second smallest eigenvalue for bipartitioning the graph on the
basis of the sign of the values in the eigenvector.

=item Step 5:

If the bipartition of the previous step yields one-versus-the-rest kind of a
partition, add the `one' cluster to the output list of clusters and invoke graph
partitioning recursively on the `rest' by going back to Step 1.  On the other hand,
if the cardinality of both the partitions returned by Step 4 exceeds 1, invoke graph
partitioning recursively on both partitions.  Stop when the list of clusters in the
output list equals C<K>.

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

=item C<K>:

This parameter supplies the number of clusters you are looking for.

=item C<P>:

This parameter specifies the dimensionality of the manifold on which the data resides.

=item C<cluster_search_multiplier>:

As should be clear from the C<Summary of the Algorithm> section, this parameter plays
a very important role in the successful clustering of your data.  As explained in
C<Description>, the basic algorithm used for clustering in Phase 1 --- clustering by
the minimization of the reconstruction error --- is sensitive to the choice of the
cluster seeds that are selected randomly at the beginning of the algorithm.  Should
it happen that the seeds miss one or more of the clusters, the clustering produced is
likely to not be correct.  By giving an integer value to C<cluster_search_multiplier>
that is greater than 1, you'll increase the odds that the randomly selected seeds
will see all clusters.  When you set C<cluster_search_multiplier> to C<M>, you ask
Phase 1 of the algorithm to construct C<M*K> clusters as opposed to just C<K>
clusters. Subsequently, in Phase 2, the module uses inter-cluster similarity based

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN


    $clusterer->linear_manifold_clusterer();   

    or 

    my $clusters = $clusterer->linear_manifold_clusterer();

This is the main call to the linear-manifold based clusterer.  The first call works
by side-effect, meaning that you will see the clusters in your terminal window and
they would be written out to disk files (depending on the constructor options you
have set).  The second call also returns the clusters as a reference to an array of
anonymous arrays, each holding the symbolic tags for a cluster.

=item B<display_reconstruction_errors_as_a_function_of_iterations()>:

    $clusterer->display_reconstruction_errors_as_a_function_of_iterations();

This method would normally be called after the clustering is completed to see how the
reconstruction errors decreased with the iterations in Phase 1 of the overall
algorithm.

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

    $clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

or

    $clusterer->visualize_clusters_on_sphere("final_clustering", $clusters, "png");

If your data is 3-dimensional and it resides on the surface of a sphere (or in the
vicinity of such a surface), you may be able to use these methods for the
visualization of the clusters produced by the algorithm.  The first invocation
produces a Gnuplot in a terminal window that you can rotate with your mouse pointer.
The second invocation produces a `.png' image of the plot.

=item B<auto_retry_clusterer()>:

    $clusterer->auto_retry_clusterer();

or

    my $clusters = $clusterer->auto_retry_clusterer();

As mentioned earlier, the module is programmed in such a way that it is more likely



( run in 0.541 second using v1.01-cache-2.11-cpan-39bf76dae61 )