Algorithm-LinearManifoldDataClusterer

 view release on metacpan or  search on metacpan

lib/Algorithm/LinearManifoldDataClusterer.pm  view on Meta::CPAN

  #  algorithm is allowed.  The actual number of iterations is controlled by the
  #  parameter delta_reconstruction_error.  The iterations stop when the change in
  #  the total "reconstruction error" from one iteration to the next is smaller than
  #  the value specified by delta_reconstruction_error.

  #  Next, you must get the module to read the data for clustering:

  $clusterer->get_data_from_csv();

  #  Finally, you invoke linear manifold clustering by:

  my $clusters = $clusterer->linear_manifold_clusterer();

  #  The value returned by this call is a reference to an array of anonymous arrays,
  #  with each anonymous array holding one cluster.  If you wish, you can have the
  #  module write the clusters to individual files by the following call:

  $clusterer->write_clusters_to_files($clusters);

  #  If you want to see how the reconstruction error changes with the iterations, you
  #  can make the call:

  $clusterer->display_reconstruction_errors_as_a_function_of_iterations();

  #  When your data is 3-dimensional and when the clusters reside on a surface that
  #  is more or less spherical, you can visualize the clusters by calling

  $clusterer->visualize_clusters_on_sphere("final clustering", $clusters);

  #  where the first argument is a label to be displayed in the 3D plot and the
  #  second argument the value returned by calling linear_manifold_clusterer().

  #  SYNTHETIC DATA GENERATION:

  #  The module includes an embedded class, DataGenerator, for generating synthetic
  #  three-dimensional data that can be used to experiment with the clustering code.
  #  The synthetic data, written out to a CSV file, consists of Gaussian clusters on
  #  the surface of a sphere.  You can control the number of clusters, the width of
  #  each cluster, and the number of samples in the clusters by giving appropriate
  #  values to the constructor parameters as shown below:

  use strict;
  use Algorithm::LinearManifoldDataClusterer;

  my $output_file = "4_clusters_on_a_sphere_1000_samples.csv";

  my $training_data_gen = DataGenerator->new(
                             output_file => $output_file,
                             cluster_width => 0.015,
                             total_number_of_samples_needed => 1000,
                             number_of_clusters_on_sphere => 4,
                             show_hidden_in_3D_plots => 0,
                          );
  $training_data_gen->gen_data_and_write_to_csv();
  $training_data_gen->visualize_data_on_sphere($output_file);


=head1 CHANGES

Version 1.01: Typos and other errors removed in the documentation. Also included in
the documentation a link to a tutorial on data processing on manifolds.


=head1 DESCRIPTION

If you are new to machine learning and data clustering on linear and nonlinear
manifolds, your first question is likely to be: What is a manifold?  A manifold is a
space that is locally Euclidean. And a space is locally Euclidean if it allows for
the points in a small neighborhood to be represented by, say, the Cartesian
coordinates and if the distances between the points in the neighborhood are given by
the Euclidean metric.  For an example, the set of all points on the surface of a
sphere does NOT constitute a Euclidean space.  Nonetheless, if you confined your
attention to a small enough neighborhood around a point, the space would seem to be
locally Euclidean.  The surface of a sphere is a 2-dimensional manifold embedded in a
3-dimensional space.  A plane in a 3-dimensional space is also a 2-dimensional
manifold. You would think of the surface of a sphere as a nonlinear manifold, whereas
a plane would be a linear manifold.  However, note that any nonlinear manifold is
locally a linear manifold.  That is, given a sufficiently small neighborhood on a
nonlinear manifold, you can always think of it as a locally flat surface.

As to why we need machine learning and data clustering on manifolds, there exist many
important applications in which the measured data resides on a nonlinear manifold.
For example, when you record images of a human face from different angles, all the
image pixels taken together fall on a low-dimensional surface in a high-dimensional
measurement space. The same is believed to be true for the satellite images of a land
mass that are recorded with the sun at different angles with respect to the direction
of the camera.

Reducing the dimensionality of the sort of data mentioned above is critical to the
proper functioning of downstream classification algorithms, and the most popular
traditional method for dimensionality reduction is the Principal Components Analysis
(PCA) algorithm.  However, using PCA is tantamount to passing a linear least-squares
hyperplane through the surface on which the data actually resides.  As to why that
might be a bad thing to do, just imagine the consequences of assuming that your data
falls on a straight line when, in reality, it falls on a strongly curving arc.  This
is exactly what happens with PCA --- it gives you a linear manifold approximation to
your data that may actually reside on a curved surface.

That brings us to the purpose of this module, which is to cluster data that resides
on a nonlinear manifold.  Since a nonlinear manifold is locally linear, we can think
of each data cluster on a nonlinear manifold as falling on a locally linear portion
of the manifold, meaning on a hyperplane.  The logic of the module is based on
finding a set of hyperplanes that best describes the data, with each hyperplane
derived from a local data cluster.  This is like constructing a piecewise linear
approximation to data that falls on a curve as opposed to constructing a single
straight line approximation to all of the data.  So whereas the frequently used PCA
algorithm gives you a single hyperplane approximation to all your data, what this
module returns is a set of hyperplane approximations, with each hyperplane derived by
applying the PCA algorithm locally to a data cluster.

That brings us to the problem of how to actually discover the best set of hyperplane
approximations to the data.  What is probably the most popular algorithm today for
that purpose is based on the following key idea: Given a set of subspaces to which a
data element can be assigned, you assign it to that subspace for which the
B<reconstruction error> is the least.  But what do we mean by a B<subspace> and what
is B<reconstruction error>?

To understand the notions of B<subspace> and B<reconstruction-error>, let's revisit
the traditional approach of dimensionality reduction by the PCA algorithm.  The PCA
algorithm consists of: (1) Subtracting from each data element the global mean of the
data; (2) Calculating the covariance matrix of the data; (3) Carrying out an



( run in 0.691 second using v1.01-cache-2.11-cpan-9288abcf80b )