ApacheLog-Compressor
view release on metacpan or search on metacpan
lib/ApacheLog/Compressor.pm view on Meta::CPAN
package ApacheLog::Compressor;
# ABSTRACT: Convert Apache/CLF data to binary format
use strict;
use warnings;
use Socket qw(inet_aton inet_ntoa);
use Date::Parse qw(str2time);
use List::Util qw(min);
use URI;
use URI::Escape qw(uri_unescape);
use DateTime;
use Encode qw(encode_utf8 decode_utf8 FB_DEFAULT is_utf8 FB_CROAK);
use POSIX qw{strftime};
our $VERSION = '0.005';
=head1 NAME
ApacheLog::Compressor - convert Apache / CLF log files into a binary format for transfer
=head1 VERSION
version 0.005
=head1 SYNOPSIS
use ApacheLog::Compressor;
use Sys::Hostname qw(hostname);
# Write all data to bzip2-compressed output file
open my $out_fh, '>', 'compressed.log.bz2' or die "Failed to create output file: $!";
binmode $out_fh;
my $zip = IO::Compress::Bzip2->new($out_fh, BlockSize100K => 9);
# Provide a callback to send data through to the file
my $alc = ApacheLog::Compressor->new(
on_write => sub {
my ($self, $pkt) = @_;
$zip->write($pkt);
}
);
# Input file - normally use whichever one's just been closed + rotated
open my $fh, '<', '/var/log/apache2/access.log.1' or die "Failed to open log: $!";
# Initial packet to identify which server this came from
$alc->send_packet('server',
hostname => hostname(),
);
# Read and compress all the lines in the files
while(my $line = <$fh>) {
$alc->compress($line);
}
close $fh or die $!;
$zip->close;
# Dump the stats in case anyone finds them useful
$alc->stats;
=head1 DESCRIPTION
Converts data from standard Apache log format into a binary stream which is typically 20% - 60% the size of the original file.
Intended for cases where log data needs transferring from multiple high-volume servers for analysis (potentially in realtime
via tail -f).
The log format is a simple dictionary replacement algorithm: each field that cannot be represented in a fixed-width datatype
is replaced with an indexed value, allowing the basic log line packet to be fixed size with additional packets containing the
first instance of each variable-width data item.
Example:
api.example.com 105327 123.15.16.108 - apiuser@example.com [19/Dec/2009:03:12:07 +0000] "POST /api/status.json HTTP/1.1" 200 80516 "-" "-" "-"
The duration, IP, timestamp, method, HTTP version, response and size can all be stored as 32-bit quantities (or smaller), without losing
any information. The vhost, user and URL are extracted to separate packets, since we expect to see them at least twice on a typical server.
This would be converted to:
=over 4
=item * vhost packet - api.example.com assigned index 0
=item * user packet - apiuser@example.com assigned index 0
=item * url packet - /api/status.json assigned index 0
=item * timestamp packet - since a busy server is likely to have several requests a second, there's a tiny saving to be had by sending this only when the value changes, so we push this into a separate packet as well.
=item * log packet - actual data, binary encoded.
=back
( run in 1.096 second using v1.01-cache-2.11-cpan-39bf76dae61 )