Geo-Index

 view release on metacpan or  search on metacpan

lib/Geo/Index.pm  view on Meta::CPAN


The point to measure distances to

This can be either a hash containing at a minimum a C<lat> and a C<lon> value 
(both in degrees) or an array giving the point.  See the B<L<Points|/POINTS>> 
section above for details.

=back

=back

=cut



# Used by Distance(...)
sub DistanceTo($$) {
	my ($self, $p1) = @_;
	
	if (ref $p1 eq 'ARRAY') {
		#. Got array; expand arguments into a full point
		$p1 = { 'lat'=>$$p1[0], 'lon'=>$$p1[1] };
	}
	
	$$p1{lat_rad} = Math::Trig::deg2rad($$p1{lat}) unless ($$p1{lat_rad});
	$$p1{lon_rad} = Math::Trig::deg2rad($$p1{lon}) unless ($$p1{lon_rad});
	
	return HaversineDistance($$p1{lat_rad}, $$p1{lon_rad});
}



#. Distance functions
#. 
#. Geo::Index uses the haversine formula to compute great circle distances 
#. between points.
#. 
#. Three versions are supported: a fallback version written in Perl (used if the 
#. C versions fail to compile) and two accelerated versions written in C, one 
#. using floats and the other using doubles.  By default the C float version is 
#. used; if it fails to compile then the Perl version is used.  Use of a specific 
#. version can also be requested with Geo::Index->SetDistanceFunctionType(...).
#. 
#. The Perl version uses doubles.  When using floats instead of doubles the loss 
#. of precision is typically under a meter (about 2 meters in the worst case).
#. #. Compared to the errors inherent to the haversine function, this loss of 
#. precision is negligable.

#. Here are the results of benchmarking the three versions on a fairly high-end 
#. workstation (higher numbers are better).  The test dataset is 1 million random
#. points and each search type was performed once for each point in random order.
#. The same points were used for each test and they were in the same order.  All
#. searches returned results as lists except for the 'all points' search which 
#. returned a list reference.  The default options (Earth, 20-level index) were 
#. used for each test.  Each version's benchmark was run 32 times; some jitter 
#. was observed.
#. 
#. Average number of operations per second using each version (rounded):
#. Percentages (in parentheses) are relative to the pure-Perl version.
#. 
#. Results for 32 iterations of each test:
#. 
#. Operation                            Perl     C double (%)    C float (%) 
#. --------------------------------    ------    ------------    ------------
#. Add point to index                   35861     36067 (101)     36060 (101)
#. Search: return all points           256127    256877 (101)    252116  (98)
#. Search: sort, max 5                   6733      8718 (129)      8860 (132)
#. Search: sort, radius 1000, max 5     45364     49063 (108)     49831 (110)
#. Search: sort, radius 1000            45902     49418 (108)     51673 (113)
#. Search: max 5                       198499    190404  (96)    204905 (103)
#. Search: radius 1000, max 5           46942     51295 (109)     54554 (116)
#. Search: radius 1000                  47908     51941 (108)     55522 (116)


#. These will be set to references to the Perl versions of the functions
my $SetUpDistance_perl      = undef;
my $HaversineDistance_perl  = undef;
my $ComputeAreaExtrema_perl = undef;
my $fast_log2_perl          = undef;

#. Choose whether to use Perl or C code for distance and log2 calculations
#. Default is to use the Perl functions

# Used by new(...)
sub SetDistanceFunctionType($) {
	my ($type) = @_;
	
	#. Get function pointers for the Perl versions (if not already recorded)
	$SetUpDistance_perl      = *SetUpDistance      unless (defined $SetUpDistance_perl);
	$HaversineDistance_perl  = *HaversineDistance  unless (defined $HaversineDistance_perl);
	$ComputeAreaExtrema_perl = *ComputeAreaExtrema unless (defined $ComputeAreaExtrema_perl);
	$fast_log2_perl          = *fast_log2          unless (defined $fast_log2_perl);
	
	#. Choose the type of functions to use:
	
	if ( $type eq 'perl' ) {
		#. Switch to using Perl code for distance and log2 calculations
		
		*Geo::Index::SetUpDistance      = $SetUpDistance_perl;
		*Geo::Index::HaversineDistance  = $HaversineDistance_perl;
		*Geo::Index::fast_log2          = $fast_log2_perl;
		*Geo::Index::ComputeAreaExtrema = $ComputeAreaExtrema_perl;
		$ACTIVE_CODE = 'perl';
		
		$C_CODE_ACTIVE = 0;
		
		return 1; # success
		
	} elsif ( $C_CODE_COMPILED && $type eq 'double' ) {
		#. Switch to using C double code for distance and log2 calculations
		
		*Geo::Index::SetUpDistance      = *Geo::Index::SetUpDistance_double;
		*Geo::Index::HaversineDistance  = *Geo::Index::HaversineDistance_double;
		*Geo::Index::fast_log2          = *Geo::Index::fast_log2_double;
		*Geo::Index::ComputeAreaExtrema = *Geo::Index::ComputeAreaExtrema_double;
		$ACTIVE_CODE = 'double';
		
		$C_CODE_ACTIVE = 1;
		
		return 1; # success
	

lib/Geo/Index.pm  view on Meta::CPAN

To help tune the C<levels> value, the C<L<GetConfiguration( )|/GetConfiguration( )>> method can 
be used to find out the physical size of the most detailed level along with 
statistics on the number of points per index tile.

=item * B<Use the C<quick_results> option when possible.>

Filtering points and combining them into a single, flat list can be very 
expensive.  Many applications can tolerate getting additional points beyond 
those matching the search criteria.  An example of this is drawing points on 
a map; if points are clipped to the visible area when they are 
drawn it may not matter if some of them lie outside of it.

=item * B<Use C<L<Search(...)|/Search( ... )>> instead of C<L<Closest(...)|/Closest( ... )>> when you have a search radius.>

The C<L<Closest(...)|/Closest( ... )>> function is most efficient when no search radius is 
specified or when result points lie very close to the search point.  Closeness 
is relative to the tile size of the most detailed index level; for the default 
index depth (C<20>), "very close" is roughly within about 100 meters.

When clipping results to a maximal radius it is typically much faster to use 
C<L<Search(...)|/Search( ... )>> with the C<sort_results> and C<max_results> options*.

For example, to find the closest C<$n> points within distance C<$d> of a point 
C<$p> it is usually much faster to use

=over

 %options = (
              max_results    => $n, 
              radius         => $d, 
              sort_results   => 1,
              post_condition => sub { return $_[0] != $_[1]; }
            );
 $results = $index->Search( $p, \%options );

=back

instead of 

=over

 $results = $index->Closest( $p, $n { radius => $d } );

=back

* The C<post_condition> shown in the example omits the search point from 
the results and is needed to fully emulate the behavior of C<Closest(...)>.

=back

=head2 Technical discussion

Both C<L<Search(...)|/Search( ... )>> and C<L<SearchByBounds(...)|/SearchByBounds( ... )>> are very fast since 
they can find the relevant index tiles in linear time.  Since the time needed to 
filter the results is directly proportional to the number of points retrieved 
from the index, best performance occurs when the size of the most detailed tiles 
is smaller than that of the typical search radius or search bounds.

Searches run using C<L<Closest(...)|/Closest( ... )>> are done starting from the most 
detailed level and work upwards.  Best performance occurs when a result is found 
in the first few iterations.  If the first iteration that finds points yields a 
large number of points then performance will suffer since the distance to each 
of these points will need to be measured to find the closest.  For similar 
reasons, requesting a large number of closest points in a single call will also 
impact performance.  The C<L<Farthest(...)|/Farthest( ... )>> method is largely a wrapper for 
C<L<Closest(...)|/Closest( ... )>> and thus exhibits similar behavior.

Some functions within Geo::Index have optional implementations written in C.  If 
these are active (by default they are whenever possible) searches typically run 
25% to 50% faster.

Whenever possible Geo::Index uses numeric index keys.  Compared to text index 
keys, numeric keys improve performance with about 30% faster speed and about 50% 
smaller index memory usage.  The downside to numeric keys is that they are less 
legible to humans while debugging.  (Whether numeric or text keys are used can 
be changed by setting the appropriate value at the top of C<Geo/Index.pm>)

=head2 Benchmark results

Typical benchmark results run on a modern workstation using numeric keys and 
double-precision C low-level code with the index containing 1,000,000 points 
are as follows:

=over

=item

B<C<L<IndexPoints(...)|/IndexPoints( ... )>>>

Points can be added to an index at the rate of about 50,000 per second.

=item

B<C<L<Search(...)|/Search( ... )>>>

Typical searches returning values run at about 25,000 to 50,000 searches per 
second.  Worst-case performance is under 50 searches per second and searches 
returning no results run at over 100,000 searches per second.  The overhead of 
traversing the results is fairly negligable.

Quick searches run at 120,000 to 150,000 searches per second.  Actually doing 
anything with the results slows things down a lot.  Including traversal of the 
results, a typical quick search runs at 40,000 to 100,000 searches per second 
with the worst-case being about 80 searches per second.

If distances to the result points are not needed, quick searches are typically 
about 75% faster than normal ones albeit with about 5 times as many results 
being returned.

=item

B<C<L<SearchByBounds(...)|/SearchByBounds( ... )>>>

For the C<L<SearchByBounds(...)|/SearchByBounds( ... )>> method run time correlates with the 
size of the bounding box with smaller bounding boxes typically yielding faster 
run times.

A fairly typical search yielding about 50 results runs at about 10,000 searches 
per second in normal mode and about 30,000 searches per second in quick mode.
A nearly worst case example is a search returning 100,000 points; this will run 
at about 5 searches per second in normal mode or about 8,000 searches per second 



( run in 0.476 second using v1.01-cache-2.11-cpan-71847e10f99 )