Catmandu-Stat
view release on metacpan or search on metacpan
lib/Catmandu/Exporter/Stat.pm view on Meta::CPAN
$stats->{mean} = $self->get_stat($key)->mean();
$stats->{variance} = sprintf "%.1f" , $self->get_stat($key)->variance();
$stats->{stdev} = sprintf "%.1f" , $self->get_stat($key)->standard_deviation();
my ($zeros,$zerosp,$occur_count,$values_count,$uniqs);
$zeros = $self->{res}->{$key}->{zero} // 0;
$values_count = $self->{res}->{$key}->{count};
$occur_count = $self->get_stat($key)->count();
$zerosp = sprintf "%.1f" , $occur_count > 0 ? 100 * $zeros / $occur_count : 100;
$uniqs = sprintf "%.1f" , $values_count > 0 ? 100 * $self->get_key_uniq($key) / $values_count : 0.0;
my $overflow = $values_count > 0 ? 100 * $self->get_key_uniq($key) / $values_count : 0.0;
$overflow = $overflow > 100 ? 1 : 0;
$stats->{zeros} = $zeros;
$stats->{'zeros%'} = $zerosp;
$stats->{'uniq~'} = floor($self->get_key_uniq($key));
$stats->{'uniq%'} = $uniqs;
$stats->{'uniq%'} .= " (!)" if $overflow;
$stats->{'uniq~'} .= " (!)" if $overflow;
$stats->{entropy} = $self->entropy($key);
$stats->{entropy} .= " (!)" if $overflow;
$exporter->add($stats);
$has_overflow = 1 if $overflow;
}
$exporter->commit;
if ($has_overflow) {
print STDERR <<EOF;
Overflow warning - probably your dataset is too small for an accurate uniq~, uniq% and entropy count...
EOF
}
}
1;
=head1 NAME
Catmandu::Exporter::Stat - a statistical export
=head1 SYNOPSIS
# Calculate statistics on the availabity of the ISBN fields in the dataset
cat data.json | catmandu convert -v JSON to Stat --fields isbn
# Export the statistics as YAML
cat data.json | catmandu convert -v JSON to Stat --fields isbn --as YAML
=head1 DESCRIPTION
The L<Catmandu::Stat> package can be used to calculate statistics on the availablity of
fields in a data file. Use this exporter to count the availability of fields or count
the number of duplicate values. For each field the exporter calculates the following
statistics:
* name : the name of a field
* count : the number of occurences of a field in all records
* zeros : the number of records without a field
* zeros% : the percentage of records without a field
* min : the minimum number of occurences of a field in any record
* max : the maximum number of occurences of a field in any record
* mean : the mean number of occurences of a field in all records
* variance : the variance of the field number
* stdev : the standard deviation of the field number
* uniq~ : the estimated number of unique records
* uniq% : the estimated percentage of uniq values
* entropy : the minimum and maximum entropy in the field values (estimated value)
Details:
* entropy is an indication in the variation of field values (are some values more unique than others)
* entropy values are displayed as : minimum/maximum entropy
* when the minimum entropy = 0, then all the field values are equal
* when the minimum and maximum entropy are equal, then all the field values are different
* the 'uniq%' and 'entropy' fields are estimated and are normally within 1% of the
correct value (this is done to keep the memory requirements of this module low)
Each statistical report contains one row named hash '#' which contains the total
number of records.
=head1 CONFIGURATION
=over 4
=item v
Verbose output. Show the processing speed.
=item fix FIX
A fix or a fix file containing one or more fixes applied to the input data before
the statistics are calculated.
=item fields KEY[,KEY,...]
One or more fields in the data for which statistics need to be calculated. No deep nested
fields are allowed. The exporter will collect statistics on the availability of a field in
all records. For instance, the following record contains one 'title' field, zero 'isbn'
fields and 3 'author' fields
---
title: ABCDEF
author:
- Davis, Miles
- Parker, Charly
- Mingus, Charles
year: 1950
Examples of operation:
# Calculate statistics on the number of records that contain a 'title'
cat data.json | catmandu convert JSON to Stat --fields title
# Calculate statistics on the number of records that contain a 'title', 'isbn' or 'subject' fields
cat data.json | catmandu convert JSON to Stat --fields title,isbn,subject
# The next example will not work: no deeply nested fields allowed
cat data.json | catmandu convert JSON to Stat --fields foo.bar.x.y
When no fields parameter is available, then all fields are read from the first input record.
=item as Table | CSV | YAML | JSON | ...
By default the statistics are exported in a Table format. The use 'as' option to change the
export format.
( run in 1.103 second using v1.01-cache-2.11-cpan-524268b4103 )