Geo-Index
view release on metacpan or search on metacpan
lib/Geo/Index.pm view on Meta::CPAN
The point to measure distances to
This can be either a hash containing at a minimum a C<lat> and a C<lon> value
(both in degrees) or an array giving the point. See the B<L<Points|/POINTS>>
section above for details.
=back
=back
=cut
# Used by Distance(...)
sub DistanceTo($$) {
my ($self, $p1) = @_;
if (ref $p1 eq 'ARRAY') {
#. Got array; expand arguments into a full point
$p1 = { 'lat'=>$$p1[0], 'lon'=>$$p1[1] };
}
$$p1{lat_rad} = Math::Trig::deg2rad($$p1{lat}) unless ($$p1{lat_rad});
$$p1{lon_rad} = Math::Trig::deg2rad($$p1{lon}) unless ($$p1{lon_rad});
return HaversineDistance($$p1{lat_rad}, $$p1{lon_rad});
}
#. Distance functions
#.
#. Geo::Index uses the haversine formula to compute great circle distances
#. between points.
#.
#. Three versions are supported: a fallback version written in Perl (used if the
#. C versions fail to compile) and two accelerated versions written in C, one
#. using floats and the other using doubles. By default the C float version is
#. used; if it fails to compile then the Perl version is used. Use of a specific
#. version can also be requested with Geo::Index->SetDistanceFunctionType(...).
#.
#. The Perl version uses doubles. When using floats instead of doubles the loss
#. of precision is typically under a meter (about 2 meters in the worst case).
#. #. Compared to the errors inherent to the haversine function, this loss of
#. precision is negligable.
#. Here are the results of benchmarking the three versions on a fairly high-end
#. workstation (higher numbers are better). The test dataset is 1 million random
#. points and each search type was performed once for each point in random order.
#. The same points were used for each test and they were in the same order. All
#. searches returned results as lists except for the 'all points' search which
#. returned a list reference. The default options (Earth, 20-level index) were
#. used for each test. Each version's benchmark was run 32 times; some jitter
#. was observed.
#.
#. Average number of operations per second using each version (rounded):
#. Percentages (in parentheses) are relative to the pure-Perl version.
#.
#. Results for 32 iterations of each test:
#.
#. Operation Perl C double (%) C float (%)
#. -------------------------------- ------ ------------ ------------
#. Add point to index 35861 36067 (101) 36060 (101)
#. Search: return all points 256127 256877 (101) 252116 (98)
#. Search: sort, max 5 6733 8718 (129) 8860 (132)
#. Search: sort, radius 1000, max 5 45364 49063 (108) 49831 (110)
#. Search: sort, radius 1000 45902 49418 (108) 51673 (113)
#. Search: max 5 198499 190404 (96) 204905 (103)
#. Search: radius 1000, max 5 46942 51295 (109) 54554 (116)
#. Search: radius 1000 47908 51941 (108) 55522 (116)
#. These will be set to references to the Perl versions of the functions
my $SetUpDistance_perl = undef;
my $HaversineDistance_perl = undef;
my $ComputeAreaExtrema_perl = undef;
my $fast_log2_perl = undef;
#. Choose whether to use Perl or C code for distance and log2 calculations
#. Default is to use the Perl functions
# Used by new(...)
sub SetDistanceFunctionType($) {
my ($type) = @_;
#. Get function pointers for the Perl versions (if not already recorded)
$SetUpDistance_perl = *SetUpDistance unless (defined $SetUpDistance_perl);
$HaversineDistance_perl = *HaversineDistance unless (defined $HaversineDistance_perl);
$ComputeAreaExtrema_perl = *ComputeAreaExtrema unless (defined $ComputeAreaExtrema_perl);
$fast_log2_perl = *fast_log2 unless (defined $fast_log2_perl);
#. Choose the type of functions to use:
if ( $type eq 'perl' ) {
#. Switch to using Perl code for distance and log2 calculations
*Geo::Index::SetUpDistance = $SetUpDistance_perl;
*Geo::Index::HaversineDistance = $HaversineDistance_perl;
*Geo::Index::fast_log2 = $fast_log2_perl;
*Geo::Index::ComputeAreaExtrema = $ComputeAreaExtrema_perl;
$ACTIVE_CODE = 'perl';
$C_CODE_ACTIVE = 0;
return 1; # success
} elsif ( $C_CODE_COMPILED && $type eq 'double' ) {
#. Switch to using C double code for distance and log2 calculations
*Geo::Index::SetUpDistance = *Geo::Index::SetUpDistance_double;
*Geo::Index::HaversineDistance = *Geo::Index::HaversineDistance_double;
*Geo::Index::fast_log2 = *Geo::Index::fast_log2_double;
*Geo::Index::ComputeAreaExtrema = *Geo::Index::ComputeAreaExtrema_double;
$ACTIVE_CODE = 'double';
$C_CODE_ACTIVE = 1;
return 1; # success
lib/Geo/Index.pm view on Meta::CPAN
To help tune the C<levels> value, the C<L<GetConfiguration( )|/GetConfiguration( )>> method can
be used to find out the physical size of the most detailed level along with
statistics on the number of points per index tile.
=item * B<Use the C<quick_results> option when possible.>
Filtering points and combining them into a single, flat list can be very
expensive. Many applications can tolerate getting additional points beyond
those matching the search criteria. An example of this is drawing points on
a map; if points are clipped to the visible area when they are
drawn it may not matter if some of them lie outside of it.
=item * B<Use C<L<Search(...)|/Search( ... )>> instead of C<L<Closest(...)|/Closest( ... )>> when you have a search radius.>
The C<L<Closest(...)|/Closest( ... )>> function is most efficient when no search radius is
specified or when result points lie very close to the search point. Closeness
is relative to the tile size of the most detailed index level; for the default
index depth (C<20>), "very close" is roughly within about 100 meters.
When clipping results to a maximal radius it is typically much faster to use
C<L<Search(...)|/Search( ... )>> with the C<sort_results> and C<max_results> options*.
For example, to find the closest C<$n> points within distance C<$d> of a point
C<$p> it is usually much faster to use
=over
%options = (
max_results => $n,
radius => $d,
sort_results => 1,
post_condition => sub { return $_[0] != $_[1]; }
);
$results = $index->Search( $p, \%options );
=back
instead of
=over
$results = $index->Closest( $p, $n { radius => $d } );
=back
* The C<post_condition> shown in the example omits the search point from
the results and is needed to fully emulate the behavior of C<Closest(...)>.
=back
=head2 Technical discussion
Both C<L<Search(...)|/Search( ... )>> and C<L<SearchByBounds(...)|/SearchByBounds( ... )>> are very fast since
they can find the relevant index tiles in linear time. Since the time needed to
filter the results is directly proportional to the number of points retrieved
from the index, best performance occurs when the size of the most detailed tiles
is smaller than that of the typical search radius or search bounds.
Searches run using C<L<Closest(...)|/Closest( ... )>> are done starting from the most
detailed level and work upwards. Best performance occurs when a result is found
in the first few iterations. If the first iteration that finds points yields a
large number of points then performance will suffer since the distance to each
of these points will need to be measured to find the closest. For similar
reasons, requesting a large number of closest points in a single call will also
impact performance. The C<L<Farthest(...)|/Farthest( ... )>> method is largely a wrapper for
C<L<Closest(...)|/Closest( ... )>> and thus exhibits similar behavior.
Some functions within Geo::Index have optional implementations written in C. If
these are active (by default they are whenever possible) searches typically run
25% to 50% faster.
Whenever possible Geo::Index uses numeric index keys. Compared to text index
keys, numeric keys improve performance with about 30% faster speed and about 50%
smaller index memory usage. The downside to numeric keys is that they are less
legible to humans while debugging. (Whether numeric or text keys are used can
be changed by setting the appropriate value at the top of C<Geo/Index.pm>)
=head2 Benchmark results
Typical benchmark results run on a modern workstation using numeric keys and
double-precision C low-level code with the index containing 1,000,000 points
are as follows:
=over
=item
B<C<L<IndexPoints(...)|/IndexPoints( ... )>>>
Points can be added to an index at the rate of about 50,000 per second.
=item
B<C<L<Search(...)|/Search( ... )>>>
Typical searches returning values run at about 25,000 to 50,000 searches per
second. Worst-case performance is under 50 searches per second and searches
returning no results run at over 100,000 searches per second. The overhead of
traversing the results is fairly negligable.
Quick searches run at 120,000 to 150,000 searches per second. Actually doing
anything with the results slows things down a lot. Including traversal of the
results, a typical quick search runs at 40,000 to 100,000 searches per second
with the worst-case being about 80 searches per second.
If distances to the result points are not needed, quick searches are typically
about 75% faster than normal ones albeit with about 5 times as many results
being returned.
=item
B<C<L<SearchByBounds(...)|/SearchByBounds( ... )>>>
For the C<L<SearchByBounds(...)|/SearchByBounds( ... )>> method run time correlates with the
size of the bounding box with smaller bounding boxes typically yielding faster
run times.
A fairly typical search yielding about 50 results runs at about 10,000 searches
per second in normal mode and about 30,000 searches per second in quick mode.
A nearly worst case example is a search returning 100,000 points; this will run
at about 5 searches per second in normal mode or about 8,000 searches per second
( run in 0.476 second using v1.01-cache-2.11-cpan-71847e10f99 )