Data-NDArray-Shared

 view release on metacpan or  search on metacpan

lib/Data/NDArray/Shared.pm  view on Meta::CPAN

package Data::NDArray::Shared;
use strict;
use warnings;
use Carp ();
our $VERSION = '0.01';
require XSLoader;
XSLoader::load('Data::NDArray::Shared', $VERSION);

*numel = \&size;
*flat  = \&to_list;

# ---------------------------------------------------------------------------
# PDL interop.  PDL is an optional, load-on-demand dependency (no build or
# runtime prereq).  Each dtype maps to a PDL type of the SAME byte width, so the
# data copies/aliases with no element conversion.  NOTE the axis order: this
# array is row-major C-order while PDL's dim(0) is the fastest-varying axis, so
# shapes are reversed across the boundary -- an (r, c) array <-> PDL dims (c, r).
# ---------------------------------------------------------------------------
my %PDL_TYPE = (   # dtype => PDL type-constructor name
    f64 => 'double',    f32 => 'float',
    i64 => 'longlong',  i32 => 'long',   i16 => 'short',  i8 => 'sbyte',
    u64 => 'ulonglong', u32 => 'ulong',  u16 => 'ushort', u8 => 'byte',
);
my %DTYPE_OF = reverse %PDL_TYPE;   # PDL type name => dtype

sub _require_pdl {
    eval { require PDL; 1 }
        or Carp::croak("Data::NDArray::Shared: PDL interop needs PDL installed (cpanm PDL)");
}
sub _pdl_ctor {
    my ($dtype) = @_;
    my $name = $PDL_TYPE{$dtype} or Carp::croak("no PDL type for dtype '$dtype'");
    exists &{"PDL::$name"}
        or Carp::croak("this PDL has no '$name' type (needed for dtype '$dtype'); upgrade PDL");
    \&{"PDL::$name"};
}

# NDArray -> a NEW (copied) PDL piddle; dims = reverse(shape).
sub to_pdl {
    my ($self) = @_;
    _require_pdl();
    my $p = PDL->new_from_specification(_pdl_ctor($self->dtype)->(), reverse $self->shape);
    ${ $p->get_dataref } = $self->buffer;   # read-locked snapshot
    $p->upd_data;
    return $p;
}

# A NEW shared NDArray copied from a piddle; $path undef => anonymous mapping.
sub from_pdl {
    my ($class, $p, $path) = @_;
    _require_pdl();
    my $tname = "" . $p->type;
    my $dt = $DTYPE_OF{$tname}
        or Carp::croak("Data::NDArray::Shared->from_pdl: unsupported PDL type '$tname'");
    $p = $p->copy;                          # force a contiguous, physical piddle
    my $self = $class->new($path, $dt, reverse $p->dims);
    $self->update_from_bytes(${ $p->get_dataref });
    return $self;
}

# Copy a piddle into THIS array in place (same dtype + shape); returns self.
sub update_from_pdl {
    my ($self, $p) = @_;
    _require_pdl();
    my $tname = "" . $p->type;
    my $dt = $DTYPE_OF{$tname}
        or Carp::croak("Data::NDArray::Shared->update_from_pdl: unsupported PDL type '$tname'");
    $dt eq $self->dtype
        or Carp::croak("update_from_pdl: dtype mismatch (piddle $dt vs array " . $self->dtype . ")");
    my @want = reverse $self->shape;
    my @got  = $p->dims;
    "@want" eq "@got"
        or Carp::croak("update_from_pdl: shape mismatch (array (@{[ $self->shape ]}) vs piddle dims (@got))");
    $p = $p->copy;
    $self->update_from_bytes(${ $p->get_dataref });
    return $self;
}

# Zero-copy: a PDL ndarray ALIASING this array's shared mmap, built via PDL's C
# API (PDL_DONTTOUCHDATA, so PDL never frees/reallocates our mapping).  In-place
# PDL ops write straight through (visible to every sharing process); reads see
# live data.  NO locking -- coordinate access yourself.  The array is kept alive
# while the piddle lives.  Needs PDL at BUILD time (the C path); croaks otherwise.
sub as_pdl_alias {
    my ($self) = @_;
    _require_pdl();
    my $typenum = _pdl_ctor($self->dtype)->()->enum;        # PDL type number for our dtype
    # _alias_pdl_create croaks if the module was built without PDL (no C path).
    my $p = $self->_alias_pdl_create($typenum, [ reverse $self->shape ]);   # dims in PDL order
    $p->hdr->{_nda_shared} = $self;   # keep the mapping alive while the piddle lives
    return $p;
}

1;
__END__

=encoding utf-8

=head1 NAME

Data::NDArray::Shared - shared-memory typed N-dimensional numeric array for Linux

=head1 SYNOPSIS

lib/Data/NDArray/Shared.pm  view on Meta::CPAN

anonymous, memfd, or fd-reopened arrays); C<memfd> returns the backing descriptor
-- the memfd of a C<new_memfd> array or the dup'd fd of a C<new_from_fd> array,
and -1 for file-backed or anonymous arrays.

=head1 STATS

C<stats()> returns a hashref describing the array:

=over 4

=item * C<dtype> -- the dtype name string.

=item * C<ndim> -- the number of dimensions.

=item * C<size> -- the total element count.

=item * C<itemsize> -- bytes per element.

=item * C<shape> -- an arrayref of the dimension sizes.

=item * C<ops> -- running count of operations that took the write lock (every
C<set>, C<set_flat>, C<fill>, C<zero>, C<reshape>, C<add_scalar>,
C<mul_scalar>, C<add>, C<subtract>, C<multiply>).

=item * C<mmap_size> -- bytes of the shared mapping.

=back

=head1 PDL INTEROP

If L<PDL> is installed the array converts to and from PDL ndarrays. PDL is an
B<optional, load-on-demand> dependency -- there is no build- or runtime prereq;
the four conversion methods (C<to_pdl>, C<from_pdl>, C<update_from_pdl>,
C<as_pdl_alias>) C<croak> if PDL is missing, while C<buffer> and
C<update_from_bytes> have no PDL dependency. Each dtype maps to a PDL type of
the B<same byte width> (C<f64> to C<double>, C<i32> to C<long>, C<u64> to
C<ulonglong>, and so on), so the data moves with no per-element conversion.

B<Axis order:> this array is row-major (C-order) while PDL's C<dim(0)> is the
B<fastest-varying> axis, so the shape is B<reversed> across the boundary -- an
C<($r, $c)> array corresponds to PDL dims C<($c, $r)>, and
C<< $piddle-E<gt>at($j, $i) >> is C<< $array-E<gt>get($i, $j) >>. The conversion
methods handle this for you.

=over 4

=item * C<< $piddle = $array->to_pdl >>

A B<new> piddle holding a B<copy> of the data, of the mapped PDL type and dims
C<< reverse($array-E<gt>shape) >>. Read under the lock, so it is a consistent
snapshot.

=item * C<< $array = Data::NDArray::Shared->from_pdl($piddle, $path) >>

A B<new> shared array B<copied> from C<$piddle> (made physical and contiguous
first); the dtype and shape follow the piddle's type and C<reverse> of its dims.
C<$path> is the backing file (C<undef> or omitted for an anonymous mapping).

=item * C<< $array->update_from_pdl($piddle) >>

Copy C<$piddle> into this array B<in place> (write-locked). The piddle's type
must match the dtype and its dims must equal C<< reverse($array-E<gt>shape) >>,
else it croaks. Returns the array.

=item * C<< $piddle = $array->as_pdl_alias >>

A piddle that B<aliases the shared mapping with no copy> (a real
C<PDL_DONTTOUCHDATA> ndarray over our memory): an B<in-place> PDL operation
(C<< $p .= ... >>, C<< $p-E<gt>inplace-E<gt>... >>) writes straight through to
shared memory -- visible to every process that maps it -- and reads see live
data. The array is kept alive for as long as the piddle.

This one method needs PDL at B<build> time (it is compiled against PDL's C API):
if the module was installed without PDL present it C<croak>s, while the copy
methods above keep working through a runtime C<require PDL>. Reinstall with PDL
installed to enable it.

B<Caveats.> The alias B<bypasses the rwlock>: you must coordinate access
yourself (no other process mutating concurrently), as with any unlocked
shared-memory view. Do not B<resize or retype> the alias (a reshape that grows
it, a type conversion) -- it is a fixed window onto the mapping; use
C<to_pdl>/C<from_pdl> when you want an independent, resizable copy.

=item * C<< $bytes = $array->buffer >>

The raw contiguous data region as a byte string (read-locked snapshot),
row-major C-order -- useful on its own for serialization or IPC, and the basis
for C<to_pdl>. C<< $array->update_from_bytes($bytes) >> is the inverse
(write-locked; the string must be exactly C<< size * itemsize >> bytes).

=back

See F<eg/pdl_interop.pl> for a worked example, including a cross-process PDL
transform on one shared array.

=head1 SHARING ACROSS PROCESSES

The array lives in a shared mapping, shared the same three ways as the rest of
the family: a B<backing file> (every process calls C<< new($path, $dtype,
@shape) >> on the same path), an B<anonymous mapping inherited across C<fork>>,
or a B<memfd> whose descriptor is passed to an unrelated process (over a UNIX
socket via C<SCM_RIGHTS>, or via C</proc/$pid/fd/$n>) and reopened with
C<< new_from_fd($fd) >>. Because the mapping is shared, B<every process reads and
writes the same elements>. All mutation is serialized by the write lock, so a
set of disjoint writers produces a well-defined final array regardless of how
they interleave.

    # parent and children fill disjoint slices of one shared array
    my $a = Data::NDArray::Shared->new(undef, "f64", 4000);   # before fork
    unless (fork) { $a->set_flat($_, $_) for 0 .. 999; exit }
    wait;
    print $a->get_flat(500), "\n";   # reflects the child's writes

=head1 SECURITY

The mmap region is writable by all processes that open it. Do not share backing
files with untrusted processes.

=head1 CRASH SAFETY

Mutation is guarded by a futex-based write-preferring rwlock with PID-encoded
ownership; if a holder dies, the next contender detects the dead owner and
recovers. Because each mutation updates the data buffer (and, for C<reshape>, a
few header words) while holding the lock, a crash leaves the array consistent up
to the last completed operation. B<Limitation>: PID reuse is not detected (very
unlikely in practice).

=head1 SEE ALSO

L<Data::Histogram::Shared>, L<Data::RoaringBitmap::Shared>,
L<Data::DisjointSet::Shared>, L<Data::CountMinSketch::Shared>,
L<Data::HyperLogLog::Shared>, L<Data::BloomFilter::Shared>,
L<Data::Intern::Shared>, L<Data::SortedSet::Shared>,
L<Data::SpatialHash::Shared>, and the rest of the C<Data::*::Shared> family.

=head1 AUTHOR

vividsnow

=head1 LICENSE

This is free software; you can redistribute it and/or modify it under the same
terms as Perl itself.

=cut



( run in 0.461 second using v1.01-cache-2.11-cpan-bbe5e583499 )