ApacheLog-Compressor
view release on metacpan or search on metacpan
lib/ApacheLog/Compressor.pm view on Meta::CPAN
use Sys::Hostname qw(hostname);
# Write all data to bzip2-compressed output file
open my $out_fh, '>', 'compressed.log.bz2' or die "Failed to create output file: $!";
binmode $out_fh;
my $zip = IO::Compress::Bzip2->new($out_fh, BlockSize100K => 9);
# Provide a callback to send data through to the file
my $alc = ApacheLog::Compressor->new(
on_write => sub {
my ($self, $pkt) = @_;
$zip->write($pkt);
}
);
# Input file - normally use whichever one's just been closed + rotated
open my $fh, '<', '/var/log/apache2/access.log.1' or die "Failed to open log: $!";
# Initial packet to identify which server this came from
$alc->send_packet('server',
hostname => hostname(),
);
# Read and compress all the lines in the files
while(my $line = <$fh>) {
$alc->compress($line);
}
close $fh or die $!;
$zip->close;
# Dump the stats in case anyone finds them useful
$alc->stats;
=head1 DESCRIPTION
Converts data from standard Apache log format into a binary stream which is typically 20% - 60% the size of the original file.
Intended for cases where log data needs transferring from multiple high-volume servers for analysis (potentially in realtime
via tail -f).
The log format is a simple dictionary replacement algorithm: each field that cannot be represented in a fixed-width datatype
is replaced with an indexed value, allowing the basic log line packet to be fixed size with additional packets containing the
first instance of each variable-width data item.
Example:
api.example.com 105327 123.15.16.108 - apiuser@example.com [19/Dec/2009:03:12:07 +0000] "POST /api/status.json HTTP/1.1" 200 80516 "-" "-" "-"
The duration, IP, timestamp, method, HTTP version, response and size can all be stored as 32-bit quantities (or smaller), without losing
any information. The vhost, user and URL are extracted to separate packets, since we expect to see them at least twice on a typical server.
This would be converted to:
=over 4
=item * vhost packet - api.example.com assigned index 0
=item * user packet - apiuser@example.com assigned index 0
=item * url packet - /api/status.json assigned index 0
=item * timestamp packet - since a busy server is likely to have several requests a second, there's a tiny saving to be had by sending this only when the value changes, so we push this into a separate packet as well.
=item * log packet - actual data, binary encoded.
=back
The following packet types are available:
=over 4
=item * 00 - Log entry
=item * 01 - Change server
=item * 02 - timestamp
=item * 03 - vhost
=item * 04 - user
=item * 05 - useragent
=item * 06 - referer
=item * 07 - url
=item * 80 - reset
=back
The log entry itself normally consists of the following fields:
N vhost
N time
N IP
N user
N useragent
N timestamp
C method
C version
n response
N bytes
N url
The format of the log file can be customised, see the next section for details.
=head3 FORMAT SPECIFICATION
A custom format can be provided as the C<format> parameter when instantiating
a new L<ApacheLog::Compressor> object via ->L</new>. This format consists of an
arrayref of key/value pairs, each value holding the following information:
=over 4
=item * id - the ID to use when sending packets
=item * type - L<pack> format specifier used when storing and retrieving the data, such as N1 or n1. Without this there will be no entry for the item in the compressed log stream
=item * regex - the regular expression used for matching this part of the log file. The
final regex will be the concatenation of all regex entries for the format, joined
using \s+ as the delimiter.
( run in 1.400 second using v1.01-cache-2.11-cpan-39bf76dae61 )