PDL-NDBin

 view release on metacpan or  search on metacpan

lib/PDL/NDBin.pm  view on Meta::CPAN

scalars, and requires the user to input the data points one by one. Similarly,
to produce the final histogram, the bins must be queried one by one.

Math::Histogram and Math::SimpleHisto::XS accept Perl arrays filled with values
(although they also accept data points one by one as Perl scalars). Passing
large amounts of data in an array is generally more efficient than passing the
data points one by one as scalars.

PDL and PDL::NDBin operate on ndarrays only, which are memory-efficient, packed
data arrays. This could be considered both an advantage and a disadvantage. The
advantage is that the ndarrays can be operated on very efficiently in C. The
disadvantage is that PDL is required!

=item Performance

In the next section (see L<PERFORMANCE>), the performance of all modules is
examined in detail.

=item Support for weighted histograms

In a weighted histogram, data points contribute by a fractional amount (or
weight) between 0 and 1. All libraries, except PDL::NDBin, support weighted
histograms. In PDL::NDBin, the weight of all data points is fixed at 1.

=item Uses PDL broadcasting

In PDL, broadcasting is a technique to automatically loop certain operations over
an arbitrary number of dimensions. An example is the sumover() operation, which
calculates the row sum. It is defined over the first dimension only (i.e., the
rows in PDL), but it will be looped automatically over all remaining
dimensions. If the ndarray is three-dimensional, for instance, sumover() will
calculate the sum in every row of every matrix.

Broadcasting is supported by the PDL functions histogram(), whistogram(), and
their two-dimensional counterparts, but not by hist() or whist(). PDL::NDBin
does not (yet) support broadcasting.

=item Variable-width bins

In a histogram with variable-width bins, the width of the bins needn't be
equal. This feature can be useful, for example, to construct bins on a
logarithmic scale. Math::GSL, Math::Histogram, and Math::SimpleHisto::XS
support variable-width bins; PDL does not, and is limited to fixed-width bins.

Since version 0.017, PDL::NDBin supports variable-width bins if an ndarray or
Perl array containing the bin boundaries is passed in via the I<grid> parameter
to axis specifications.

=back

=head1 PERFORMANCE

=head2 One-dimensional histograms

This section aims to give an idea of the performance of PDL::NDBin. Some of the
most important features of PDL::NDBin aren't found in other modules on CPAN.
But there are a few histogramming modules on CPAN, and it is interesting to
examine how well PDL::NDBin does in comparison.

I've run a number of tests with PDL version 0.008 on a laptop with an Intel i3
CPU running at 2.40 GHz, and on a desktop with an Intel i7 CPU running at 2.80
GHz and fast disks. The following table, obtained with 100 bins and a data file
of 2 million data points, shows typical results on the laptop:

	Benchmark: timing 50 iterations of MGH, MH, MSHXS, PND, hist, histogram...
	       MGH: 42 wallclock secs (42.48 usr +  0.05 sys = 42.53 CPU) @  1.18/s (n=50)
	        MH:  6 wallclock secs ( 5.53 usr +  0.00 sys =  5.53 CPU) @  9.04/s (n=50)
	     MSHXS:  2 wallclock secs ( 2.21 usr +  0.01 sys =  2.22 CPU) @ 22.52/s (n=50)
	       PND:  2 wallclock secs ( 1.40 usr +  0.00 sys =  1.40 CPU) @ 35.71/s (n=50)
	      hist:  1 wallclock secs ( 1.09 usr +  0.00 sys =  1.09 CPU) @ 45.87/s (n=50)
	 histogram:  1 wallclock secs ( 1.08 usr +  0.00 sys =  1.08 CPU) @ 46.30/s (n=50)

	Relative performance:
	            Rate       MGH        MH     MSHXS       PND      hist histogram
	MGH       1.18/s        --      -87%      -95%      -97%      -97%      -97%
	MH        9.04/s      669%        --      -60%      -75%      -80%      -80%
	MSHXS     22.5/s     1816%      149%        --      -37%      -51%      -51%
	PND       35.7/s     2938%      295%       59%        --      -22%      -23%
	hist      45.9/s     3802%      407%      104%       28%        --       -1%
	histogram 46.3/s     3838%      412%      106%       30%        1%        --

From this test and other tests, it can be concluded that PDL::NDBin (shown as
'PND' in the table) is, roughly speaking,

=over 4

=item 1. faster than Math::GSL::Histogram (shown as MGH in the table)

Although this module is actually a wrapper around the C library GSL, the
performance is rather low. The process of getting a large number of data points
into Math::GSL::Histogram's data structures is inefficient, as the data points
have to be input one by one.

=item 2. faster than Math::Histogram (shown as MH)

This library wraps another multidimensional histogramming library written in C.
It allows inputting multiple data points at once. It is quite a bit faster than
Math::GSL::Histogram, but does not offer the raw performance of PDL or
Math::Histogram's cousin Math::SimpleHisto::XS.

=item 3. faster than Math::SimpleHisto::XS (shown as MSHXS)

Math::SimpleHisto::XS, by the same author as Math::Histogram, is similar to the
latter library, but implemented in XS for speed, and limited to one-dimensional
histograms. It is slower than PDL::NDBin.

=item 4. slower than PDL

PDL's built-in functions hist() and histogram() are, on average, the fastest
functions. Given that the core of these routines runs entirely in C, this is
not very surprising. The PDL functions have very low overhead and are very
memory-efficient.

=back

Note that, in the tests, various data conversions between ndarrays and ordinary
Perl arrays were required. The timings exclude these conversions, and count
only the time required to produce a histogram from the "natural" data
structure, i.e. ndarrays for PDL-based modules, and ordinary Perl arrays for the
other modules.



( run in 1.900 second using v1.01-cache-2.11-cpan-98e64b0badf )