Catmandu-Stat

 view release on metacpan or  search on metacpan

lib/Catmandu/Exporter/Stat.pm  view on Meta::CPAN

        $stats->{mean}     = $self->get_stat($key)->mean();
        $stats->{variance} = sprintf "%.1f" , $self->get_stat($key)->variance();
        $stats->{stdev}    = sprintf "%.1f" , $self->get_stat($key)->standard_deviation();
        my ($zeros,$zerosp,$occur_count,$values_count,$uniqs);
        $zeros  = $self->{res}->{$key}->{zero} // 0;
        $values_count  = $self->{res}->{$key}->{count};
        $occur_count   = $self->get_stat($key)->count();
        $zerosp = sprintf "%.1f" , $occur_count > 0 ? 100 * $zeros / $occur_count : 100;
        $uniqs  = sprintf "%.1f" , $values_count > 0 ? 100 * $self->get_key_uniq($key) / $values_count : 0.0;

        my $overflow = $values_count > 0 ? 100 * $self->get_key_uniq($key) / $values_count : 0.0;
        $overflow    = $overflow > 100 ? 1 : 0;

        $stats->{zeros}    = $zeros;
        $stats->{'zeros%'} = $zerosp;
        $stats->{'uniq~'}  = floor($self->get_key_uniq($key));
        $stats->{'uniq%'}  = $uniqs;
        $stats->{'uniq%'} .= " (!)" if $overflow;
        $stats->{'uniq~'} .= " (!)" if $overflow;
        $stats->{entropy}  = $self->entropy($key);
        $stats->{entropy} .= " (!)" if $overflow;

        $exporter->add($stats);

        $has_overflow = 1 if $overflow;
    }

    $exporter->commit;

    if ($has_overflow) {
        print STDERR <<EOF;
Overflow warning - probably your dataset is too small for an accurate uniq~, uniq% and entropy count...
EOF
    }
}

1;

=head1 NAME

Catmandu::Exporter::Stat - a statistical export

=head1 SYNOPSIS

    # Calculate statistics on the availabity of the ISBN fields in the dataset
    cat data.json | catmandu convert -v JSON to Stat --fields isbn

    # Export the statistics as YAML
    cat data.json | catmandu convert -v JSON to Stat --fields isbn --as YAML

=head1 DESCRIPTION

The L<Catmandu::Stat> package can be used to calculate statistics on the availablity of
fields in a data file. Use this exporter to count the availability of fields or count
the number of duplicate values. For each field the exporter calculates the following
statistics:

  * name    : the name of a field
  * count   : the number of occurences of a field in all records
  * zeros   : the number of records without a field
  * zeros%  : the percentage of records without a field
  * min     : the minimum number of occurences of a field in any record
  * max     : the maximum number of occurences of a field in any record
  * mean    : the mean number of occurences of a field in all records
  * variance : the variance of the field number
  * stdev   : the standard deviation of the field number
  * uniq~   : the estimated number of unique records
  * uniq%   : the estimated percentage of uniq values
  * entropy : the minimum and maximum entropy in the field values (estimated value)

Details:

  * entropy is an indication in the variation of field values (are some values more unique than others)
  * entropy values are displayed as : minimum/maximum entropy
  * when the minimum entropy = 0, then all the field values are equal
  * when the minimum and maximum entropy are equal, then all the field values are different
  * the 'uniq%' and 'entropy' fields are estimated and are normally within 1% of the
    correct value (this is done to keep the memory requirements of this module low)

Each statistical report contains one row named hash '#' which contains the total
number of records.

=head1 CONFIGURATION

=over 4

=item v

Verbose output. Show the processing speed.

=item fix FIX

A fix or a fix file containing one or more fixes applied to the input data before
the statistics are calculated.

=item fields KEY[,KEY,...]

One or more fields in the data for which statistics need to be calculated. No deep nested
fields are allowed. The exporter will collect statistics on the availability of a field in
all records. For instance, the following record contains one 'title' field, zero 'isbn'
fields and 3 'author' fields

    ---
    title: ABCDEF
    author:
        - Davis, Miles
        - Parker, Charly
        - Mingus, Charles
    year: 1950

Examples of operation:

    # Calculate statistics on the number of records that contain a 'title'
    cat data.json | catmandu convert JSON to Stat --fields title

    # Calculate statistics on the number of records that contain a 'title', 'isbn' or 'subject' fields
    cat data.json | catmandu convert JSON to Stat --fields title,isbn,subject

    # The next example will not work: no deeply nested fields allowed
    cat data.json | catmandu convert JSON to Stat --fields foo.bar.x.y

When no fields parameter is available, then all fields are read from the first input record.

=item as Table | CSV | YAML | JSON | ...

By default the statistics are exported in a Table format. The use 'as' option to change the
export format.



( run in 1.103 second using v1.01-cache-2.11-cpan-524268b4103 )