ApacheLog-Compressor

 view release on metacpan or  search on metacpan

lib/ApacheLog/Compressor.pm  view on Meta::CPAN

 use Sys::Hostname qw(hostname);

 # Write all data to bzip2-compressed output file
 open my $out_fh, '>', 'compressed.log.bz2' or die "Failed to create output file: $!";
 binmode $out_fh;
 my $zip = IO::Compress::Bzip2->new($out_fh, BlockSize100K => 9);

 # Provide a callback to send data through to the file
 my $alc = ApacheLog::Compressor->new(
	on_write	=> sub {
		my ($self, $pkt) = @_;
		$zip->write($pkt);
	}
 );

 # Input file - normally use whichever one's just been closed + rotated
 open my $fh, '<', '/var/log/apache2/access.log.1' or die "Failed to open log: $!";

 # Initial packet to identify which server this came from
 $alc->send_packet('server',
 	hostname	=> hostname(),
 );

 # Read and compress all the lines in the files
 while(my $line = <$fh>) {
	 $alc->compress($line);
 }
 close $fh or die $!;
 $zip->close;

 # Dump the stats in case anyone finds them useful
 $alc->stats;

=head1 DESCRIPTION

Converts data from standard Apache log format into a binary stream which is typically 20% - 60% the size of the original file.
Intended for cases where log data needs transferring from multiple high-volume servers for analysis (potentially in realtime
via tail -f).

The log format is a simple dictionary replacement algorithm: each field that cannot be represented in a fixed-width datatype
is replaced with an indexed value, allowing the basic log line packet to be fixed size with additional packets containing the
first instance of each variable-width data item.

Example:

 api.example.com 105327 123.15.16.108 - apiuser@example.com [19/Dec/2009:03:12:07 +0000] "POST /api/status.json HTTP/1.1" 200 80516 "-" "-" "-"

The duration, IP, timestamp, method, HTTP version, response and size can all be stored as 32-bit quantities (or smaller), without losing
any information. The vhost, user and URL are extracted to separate packets, since we expect to see them at least twice on a typical server.

This would be converted to:

=over 4

=item * vhost packet - api.example.com assigned index 0

=item * user packet - apiuser@example.com assigned index 0

=item * url packet - /api/status.json assigned index 0

=item * timestamp packet - since a busy server is likely to have several requests a second, there's a tiny saving to be had by sending this only when the value changes, so we push this into a separate packet as well.

=item * log packet - actual data, binary encoded.

=back

The following packet types are available:

=over 4

=item * 00 - Log entry

=item * 01 - Change server

=item * 02 - timestamp

=item * 03 - vhost

=item * 04 - user

=item * 05 - useragent

=item * 06 - referer

=item * 07 - url

=item * 80 - reset

=back

The log entry itself normally consists of the following fields:

 N vhost
 N time
 N IP
 N user
 N useragent
 N timestamp
 C method
 C version
 n response
 N bytes
 N url

The format of the log file can be customised, see the next section for details.

=head3 FORMAT SPECIFICATION

A custom format can be provided as the C<format> parameter when instantiating
a new L<ApacheLog::Compressor> object via ->L</new>. This format consists of an
arrayref of key/value pairs, each value holding the following information:

=over 4

=item * id - the ID to use when sending packets

=item * type - L<pack> format specifier used when storing and retrieving the data, such as N1 or n1. Without this there will be no entry for the item in the compressed log stream

=item * regex - the regular expression used for matching this part of the log file. The
final regex will be the concatenation of all regex entries for the format, joined
using \s+ as the delimiter.



( run in 1.400 second using v1.01-cache-2.11-cpan-39bf76dae61 )