Algorithm-DBSCAN

 view release on metacpan or  search on metacpan

lib/Algorithm/DBSCAN.pm  view on Meta::CPAN

=head1 NAME

Algorithm::DBSCAN - (ALFA code) Perl implementation of the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm

=cut

our $VERSION = '0.07';

=head1 SYNOPSIS

This module can be used to find clusters of points in a multidimensional space. 
More information can be found on Wikipedia: L<DBSCAN|https://en.wikipedia.org/wiki/DBSCAN>

The simple usage:

    use Algorithm::DBSCAN;
    
    my $points_data_file =     
        'point_1 56.514307478581514 37.146118456702034
        point_2 34.02049221667614 46.024651786417536
        point_3 23.473087508078684 60.62328221968349
        point_4 10.418513808840482 24.59808378533684
        point_5 10.583414831970764 25.902459835735534
        point_6 9.756855426925464 24.062840099892146
        point_7 10.567067873860672 22.32511341184489
        point_8 11.070046359352189 25.91278382647844
        point_9 9.537780590838175 25.000630928726288
        point_10 10.507367338512058 27.637356924097915
        point_11 11.949089580614444 30.67843911922257
        point_12 10.373548645248105 25.699863108892945
        point_13 47.061169019689615 12.482585189174058
        point_14 47.00269836645959 12.04880276389404
        point_15 47.197663384856476 12.899232975457025
        point_16 44.3719178488551 15.41709269630616
        point_17 46.31921200316786 12.556849509965417
        point_18 44.128763621333135 14.657970021594974
        point_19 48.89953587475758 15.183892607591467
        point_20 52.15333345222132 16.354597634497154
        point_21 50.03978361242539 14.85901473647285';

    my $dataset = Algorithm::DBSCAN::Dataset->new();
    my @lines = split(/\n\s+/, $points_data_file);
    foreach my $line (@lines) {
        $dataset->AddPoint(new Algorithm::DBSCAN::Point(split(/\s+/, $line)));
    }

    my $dbscan = Algorithm::DBSCAN->new($dataset, 4 * 4, 2);

    $dbscan->FindClusters();
    $dbscan->PrintClustersShort();
    
If you have huge datasets and want to use multiple CPUs in a optimal way you can build 
the region index with an external tool (will soon be available). En axample of code that 
uses a region index would be as follow.

Given the dataset:

    point_1 56 37
    point_2 34 46
    point_3 23 60
    point_4 10 24
    point_5 10 25
    point_6 9 24
    point_7 10 22
    point_8 11 25
    point_9 9 25
    point_10 10 27
    point_11 11 30
    point_12 10 25
    point_13 47 12
    point_14 47 12
    point_15 47 12
    point_16 44 15
    point_17 46 12
    point_18 44 14
    point_19 48 15
    point_20 52 16
    point_21 50 14

The region index with $eps = 4 x 4 and $min_distance = 2 would look like this:

    0 0
    1 1
    2 2
    3 3 4 5 6 7 8 9 11
    4 3 4 5 6 7 8 9 11
    5 3 4 5 6 7 8 9 11
    10 9 10
    12 12 13 14 16 17 18 20
    11 3 4 5 6 7 8 9 11
    13 12 13 14 16 17 18 20
    14 12 13 14 16 17 18 20
    15 15 16 17
    16 12 13 14 15 16 17 18
    18 12 13 14 16 18 20
    7 3 4 5 6 7 8 9 11
    20 12 13 14 18 19 20
    6 3 4 5 6 7 8 11
    17 12 13 14 15 16 17
    19 19 20
    8 3 4 5 6 7 8 9 11
    9 3 4 5 7 8 9 10 11

To use this index you can use the following code:

    use Algorithm::DBSCAN;
    
    my $points_data_file =     
        'point_1 56 37
        point_2 34 46
        point_3 23 60
        point_4 10 24
        point_5 10 25
        point_6 9 24
        point_7 10 22
        point_8 11 25
        point_9 9 25
        point_10 10 27
        point_11 11 30
        point_12 10 25
        point_13 47 12
        point_14 47 12
        point_15 47 12
        point_16 44 15
        point_17 46 12
        point_18 44 14
        point_19 48 15
        point_20 52 16
        point_21 50 14';

    my $dataset = Algorithm::DBSCAN::Dataset->new();
    my @lines = split(/\n\s+/, $points_data_file);
    foreach my $line (@lines) {
        $dataset->AddPoint(new Algorithm::DBSCAN::Point(split(/\s+/, $line)));
    }

    my $dbscan = Algorithm::DBSCAN->new($dataset, 4 * 4, 2);

    $dbscan->UseRegionIndex(the filename containg the previous region index);
    $dbscan->FindClusters();
    $dbscan->PrintClustersShort();
  

=head1 SUBROUTINES/METHODS

=cut

=head2 new

The constructor takes 3 parameters:
    
    $dataset: The Algorithm::DBSCAN::Dataset dataset object
        
        Create the Dataset object:
            my $dataset = Algorithm::DBSCAN::Dataset->new();
        
        Add points (the first parameter is a point_id the other are point coordinates)
            $dataset->AddPoint(new Algorithm::DBSCAN::Point('point_1', 1, 2, 3, 4, 5);
            
    $eps: The epsilon parameter used for region density computation
        WARNING: This implementation uses the sqare distance between the points to avoid 
        a useless square root call. If you want to use the euclidian distance you need to 
        convert it to the right value yourself.
        
        For example for the previous point with 5 dimensions 
        $eps = $euclidian_distance * $euclidian_distance * $euclidian_distance * $euclidian_distance * $euclidian_distance; 
        
    $min_points: the minimal number of points in a region with a radius of $eps. $eps 
    and $min_points are the 2 parameters used to compute the denisty of a region. If 
    the number of points in a region with radius $eps is lower than $min_points the 
    point is considered as an outlier point that can't be included in any cluster.



( run in 3.927 seconds using v1.01-cache-2.11-cpan-56fb94df46f )