Algorithm-LinearManifoldDataClusterer

 view release on metacpan or  search on metacpan

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

  #  which says that the symbolic tag is in the first column and that the numerical
  #  data in the next three columns is to be used for clustering.  If your data file
  #  had, say, five columns and you wanted only the last three columns to be
  #  clustered, the mask would become `N0111' assuming that that the symbolic tag is
  #  still in the first column.

  #  Now you must construct an instance of the clusterer through a call such as:

  my $clusterer = Algorithm::LinearManifoldDataClusterer->new(
                                    datafile => $datafile,
                                    mask     => $mask,
                                    K        => 3,     
                                    P        => 2,     
                                    max_iterations => 15,
                                    cluster_search_multiplier => 2,
                                    delta_reconstruction_error => 0.001,
                                    terminal_output => 1,
                                    visualize_each_iteration => 1,
                                    show_hidden_in_3D_plots => 1,
                                    make_png_for_each_iteration => 1,
                  );

  #  where the parameter K specifies the number of clusters you expect to find in
  #  your data and the parameter P is the dimensionality of the manifold on which the
  #  data resides.  The parameter cluster_search_multiplier is for increasing the
  #  odds that the random seeds chosen initially for clustering will populate all the
  #  clusters.  Set this parameter to a low number like 2 or 3. The parameter
  #  max_iterations places a hard limit on the number of iterations that the
  #  algorithm is allowed.  The actual number of iterations is controlled by the
  #  parameter delta_reconstruction_error.  The iterations stop when the change in
  #  the total "reconstruction error" from one iteration to the next is smaller than
  #  the value specified by delta_reconstruction_error.

  #  Next, you must get the module to read the data for clustering:

  $clusterer->get_data_from_csv();

  #  Finally, you invoke linear manifold clustering by:

  my $clusters = $clusterer->linear_manifold_clusterer();

  #  The value returned by this call is a reference to an array of anonymous arrays,
  #  with each anonymous array holding one cluster.  If you wish, you can have the
  #  module write the clusters to individual files by the following call:

  $clusterer->write_clusters_to_files($clusters);

  #  If you want to see how the reconstruction error changes with the iterations, you
  #  can make the call:

  $clusterer->display_reconstruction_errors_as_a_function_of_iterations();

  #  When your data is 3-dimensional and when the clusters reside on a surface that
  #  is more or less spherical, you can visualize the clusters by calling

  $clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

  #  where the first argument is a label to be displayed in the 3D plot and the
  #  second argument the value returned by calling linear_manifold_clusterer().

  #  SYNTHETIC DATA GENERATION:

  #  The module includes an embedded class, DataGenerator, for generating synthetic
  #  three-dimensional data that can be used to experiment with the clustering code.
  #  The synthetic data, written out to a CSV file, consists of Gaussian clusters on
  #  the surface of a sphere.  You can control the number of clusters, the width of
  #  each cluster, and the number of samples in the clusters by giving appropriate
  #  values to the constructor parameters as shown below:

  use strict;
  use Algorithm::LinearManifoldDataClusterer;

  my $output_file = "4_clusters_on_a_sphere_1000_samples.csv";

  my $training_data_gen = DataGenerator->new(
                             output_file => $output_file,
                             cluster_width => 0.015,
                             total_number_of_samples_needed => 1000,
                             number_of_clusters_on_sphere => 4,
                             show_hidden_in_3D_plots => 0,
                          );
  $training_data_gen->gen_data_and_write_to_csv();
  $training_data_gen->visualize_data_on_sphere($output_file);


=head1 CHANGES

Version 1.01: Typos and other errors removed in the documentation. Also included in
the documentation a link to a tutorial on data processing on manifolds.


=head1 DESCRIPTION

If you are new to machine learning and data clustering on linear and nonlinear
manifolds, your first question is likely to be: What is a manifold?  A manifold is a
space that is locally Euclidean. And a space is locally Euclidean if it allows for
the points in a small neighborhood to be represented by, say, the Cartesian
coordinates and if the distances between the points in the neighborhood are given by
the Euclidean metric.  For an example, the set of all points on the surface of a
sphere does NOT constitute a Euclidean space.  Nonetheless, if you confined your
attention to a small enough neighborhood around a point, the space would seem to be
locally Euclidean.  The surface of a sphere is a 2-dimensional manifold embedded in a
3-dimensional space.  A plane in a 3-dimensional space is also a 2-dimensional
manifold. You would think of the surface of a sphere as a nonlinear manifold, whereas
a plane would be a linear manifold.  However, note that any nonlinear manifold is
locally a linear manifold.  That is, given a sufficiently small neighborhood on a
nonlinear manifold, you can always think of it as a locally flat surface.

As to why we need machine learning and data clustering on manifolds, there exist many
important applications in which the measured data resides on a nonlinear manifold.
For example, when you record images of a human face from different angles, all the
image pixels taken together fall on a low-dimensional surface in a high-dimensional
measurement space. The same is believed to be true for the satellite images of a land
mass that are recorded with the sun at different angles with respect to the direction
of the camera.

Reducing the dimensionality of the sort of data mentioned above is critical to the
proper functioning of downstream classification algorithms, and the most popular
traditional method for dimensionality reduction is the Principal Components Analysis
(PCA) algorithm.  However, using PCA is tantamount to passing a linear least-squares
hyperplane through the surface on which the data actually resides.  As to why that

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

This is the main call to the linear-manifold based clusterer.  The first call works
by side-effect, meaning that you will see the clusters in your terminal window and
they would be written out to disk files (depending on the constructor options you
have set).  The second call also returns the clusters as a reference to an array of
anonymous arrays, each holding the symbolic tags for a cluster.

=item B<display_reconstruction_errors_as_a_function_of_iterations()>:

    $clusterer->display_reconstruction_errors_as_a_function_of_iterations();

This method would normally be called after the clustering is completed to see how the
reconstruction errors decreased with the iterations in Phase 1 of the overall
algorithm.

=item B<write_clusters_to_files()>:

    $clusterer->write_clusters_to_files($clusters);

As its name implies, when you call this methods, the final clusters would be written
out to disk files.  The files have names like:

     cluster0.txt 
     cluster1.txt 
     cluster2.txt
     ...
     ...

Before the clusters are written to these files, the module destroys all files with
such names in the directory in which you call the module.

=item B<visualize_clusters_on_sphere()>:

    $clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

or

    $clusterer->visualize_clusters_on_sphere("final_clustering", $clusters, "png");

If your data is 3-dimensional and it resides on the surface of a sphere (or in the
vicinity of such a surface), you may be able to use these methods for the
visualization of the clusters produced by the algorithm.  The first invocation
produces a Gnuplot in a terminal window that you can rotate with your mouse pointer.
The second invocation produces a `.png' image of the plot.

=item B<auto_retry_clusterer()>:

    $clusterer->auto_retry_clusterer();

or

    my $clusters = $clusterer->auto_retry_clusterer();

As mentioned earlier, the module is programmed in such a way that it is more likely
to fail than to give you a wrong answer.  If manually trying the clusterer repeatedly
on a data file is frustrating, you can use C<auto_retry_clusterer()> to automatically
make repeated attempts for you.  See the script C<example4.pl> for how you can use
C<auto_retry_clusterer()> in your own code.

=back

=head1 GENERATING SYNTHETIC DATA FOR EXPERIMENTING WITH THE CLUSTERER

The module file also contains a class named C<DataGenerator> for generating synthetic
data for experimenting with linear-manifold based clustering.  At this time, only
3-dimensional data that resides in the form of Gaussian clusters on the surface of a
sphere is generated.  The generated data is placed in a CSV file.  You construct an
instance of the C<DataGenerator> class by a call like:

=over 4

=item B<new():>

    my $training_data_gen = DataGenerator->new(
                                 output_file => $output_file,
                                 cluster_width => 0.0005,
                                 total_number_of_samples_needed => 1000,
                                 number_of_clusters_on_sphere => 4,
                                 show_hidden_in_3D_plots => 0,
                            );

=back

=head2 Parameters for the DataGenerator constructor:

=over 8

=item C<output_file>:

The numeric values are generated using a bivariate Gaussian distribution whose two
independent variables are the azimuth and the elevation angles on the surface of a
unit sphere.  The mean of each cluster is chosen randomly and its covariance set in
proportion to the value supplied for the C< cluster_width> parameter.

=item C<cluster_width>:

This parameter controls the spread of each cluster on the surface of the unit sphere.

=item C<total_number_of_samples_needed>:

As its name implies, this parameter specifies the total number of data samples that
will be written out to the output file --- provided this number is divisible by the
number of clusters you asked for.  If the divisibility condition is not satisfied,
the number of data samples actually written out will be the closest it can be to the
number you specify for this parameter under the condition that equal number of
samples will be created for each cluster.

=item C<number_of_clusters_on_sphere>:

Again as its name implies, this parameters specifies the number of clusters that will
be produced on the surface of a unit sphere.

=item C<show_hidden_in_3D_plots>:

This parameter is important for the visualization of the clusters and it controls
whether you will see the generated data on the back side of the sphere.  If the
clusters are not too spread out, you can set this parameter to 0 and see all the
clusters all at once.  However, when the clusters are spread out, it can be visually
confusing to see the data on the back side of the sphere.  Note that no matter how
you set this parameter, you can interact with the 3D plot of the data and rotate it
with your mouse pointer to see all of the data that is generated.



( run in 2.210 seconds using v1.01-cache-2.11-cpan-140bd7fdf52 )