DBSchema-Normalizer

 view release on metacpan or  search on metacpan

Normalizer.pm  view on Meta::CPAN

	artist_id INT(11) NOT NULL, 
	PRIMARY KEY (album_id), 
	KEY genre (genre), KEY album (album), KEY artist_id (artist_id));
	
	INSERT INTO albums 
	SELECT src.album_id, src.album, src.genre, artist_id 
	FROM tmp_albums src 
	INNER JOIN artists lkp ON (src.artist =lkp.artist);

 mysql> describe artists;
 +-----------+-------------+------+-----+---------+----------------+
 | Field     | Type        | Null | Key | Default | Extra          |
 +-----------+-------------+------+-----+---------+----------------+
 | artist_id | int(11)     |      | PRI | NULL    | auto_increment |
 | artist    | varchar(20) |      | MUL |         |                |
 +-----------+-------------+------+-----+---------+----------------+
 2 rows in set (0.00 sec)
 
 mysql> describe albums;
 +-----------+-------------+------+-----+---------+----------------+
 | Field     | Type        | Null | Key | Default | Extra          |
 +-----------+-------------+------+-----+---------+----------------+
 | album_id  | int(11)     |      | PRI | NULL    | auto_increment |
 | album     | varchar(30) |      | MUL |         |                |
 | genre     | varchar(10) |      | MUL |         |                |
 | artist_id | int(11)     |      | MUL | 0       |                |
 +-----------+-------------+------+-----+---------+----------------+
 4 rows in set (0.00 sec)
 
 mysql> describe songs;
 +----------+-------------+------+-----+----------+----------------+
 | Field    | Type        | Null | Key | Default  | Extra          |
 +----------+-------------+------+-----+----------+----------------+
 | ID       | int(11)     |      | PRI | NULL     | auto_increment |
 | title    | varchar(40) |      | MUL |          |                |
 | duration | time        |      |     | 00:00:00 |                |
 | size     | int(11)     |      |     | 0        |                |
 | album_id | int(11)     |      | MUL | 0        |                |
 +----------+-------------+------+-----+----------+----------------+
 5 rows in set (0.00 sec)
 
It should be clear now WHAT we have to do. Less clear is HOW. The above instructions seem to imply that we manually copy the field structure from the source table to the lookup and normalized tables.

Actually, that SQL code (except the DESCRIBEs) was produced by this very module and printed to the STDOUT, so that all I had to do was some cut-and-paste.
And then we are back to the question of the algorithm. If this is all SQL, where is Perl involved?
The answer is that Perl will reduce the amount of information we need to give to the database engine.
The information about the field structure and indexes is already in the database. Our Perl module (with a [not so] little help from the DBI) can extract the structure from the database and create the appropriate SQL statements.
On the practical side, this means that, before producing SQL code, this module will gather information about the source table. It will issue a "SHOW FIELDS FROM tablename" and a "SHOW INDEX FROM tablename" statements, and parse their results to prepa...

That's it. It seems a rather small contribution to your average workload, but if you ever have to deal with a project involving several large tables, with many fields, to be transformed into many normalized tables, I am sure you will appreciate the G...

BTW, this is the code used to produce the above SQL statements:

	#!/usr/bin/perl -w
	use strict;

	use DBSchema::Normalizer;

	my $norm = DBSchema::Normalizer->new ({
		DSN  => "DBI:mysql:music;host=localhost;"
			 . "mysql_read_default_file=$ENV{HOME}/.my.cnf", 
	  	src_table     => "MP3",
	  	index_field   => "album_id",
	  	lookup_fields => "artist,album,genre",
	  	lookup_table  => "tmp_albums", 
		dest_table    => "songs",
		copy_indexes  =>  1,
		simulate      =>  1
	 });

	$norm->do();

	$norm->create_lookup_table ({ 
	  src_table     => "tmp_albums",
	  index_field   => "artist_id",
	  lookup_fields => "artist",
	  lookup_table  => "artists"
  	});

	$norm->create_normalized_table ({
	  src_table     => "tmp_albums",
	  lookup_table  => "artists",
	  index_field   => "artist_id",
	  lookup_fields => "artist",
	  dest_table    => "albums"
	});

Twenty-five lines of code. Not bad for such a complicated task. But even that could have been replaced by these two one-liners:

 perl -e 'use DBSchema::Normalizer; DBSchema::Normalizer->snew(qw(localhost music MP3 \ 
       album_id album,artist,genre tmp_albums songs 1 1 1))->do()'
	
 perl -e 'use DBSchema::Normalizer; DBSchema::Normalizer->snew(qw(localhost music \ 
    tmp_albums artist_id artist artists albums 1 1 1))->do()'

(See below item "snew" for more details.)
	
One thing that this module won't do for you, though, is to decide which columns should stay with the source table and which ones should go to the lookup table. This is something for which you need to apply some database theory, and I don't expect you...
I am planning (in a very idle and distant way) another module that will analyze a database table and decide if it is a good design or not. The examples from commercial software I have seen so far did not impress me a lot. I am still convinced that hu...

=head2 Simulation mode

This module can do the data transfer for you, or you may want to run it in "simulation mode", by adding simulate => "1" to the constructor parameters. When in simulation mode, the DBSchema::Normalizer will just print the necessary SQL statements to S...

=head2 EXPORT

new, snew, create_lookup_table, create_normalized_table

=head2 DEPENDENCIES

DBI, DBD::mysql

=head2 Architecture

The Normalizer doesn't enforce private data protection. You are only supposed to call the methods which are documented here as public. In the spirit of Perl OO philosophy, nobody will prevent you from calling private methods (the ones beginning with ...

=head2 PORTABILITY

The algorithm used here is general. It was initially developed in C for an embedded RDBMS and there is no reason to assume that it won't work in any other database engine. However, the information about field types and indexes is, in this module, MyS...
To adapt this module for a different database, corresponding SQL statements for the MYSQL specific SHOW INDEX and SHOW FIELDS should be provided. Also the syntax for INNER JOIN might not be portable across databases.

Normalizer.pm  view on Meta::CPAN

        $_dbh= DBI->connect($params->{DSN}, $params->{username}, 
		$params->{password}, { RaiseError => 1});
    }
	my $self = bless {
		verbose         => 0,
		copy_indexes    => 0,
		simulate        => 0,
		ignore_warning  => 0,
		_dbh            => $_dbh # Being an object, $_dbh is already 
		                         # a reference. Doesn't need the "\" 
								 # before it.
	}, $class;
	foreach my $key (keys (%$params)) {
		croak "invalid parameter $key \n" unless exists $_accepted_params{$key}; 
		$self->{$key} = $params->{$key};
	}
	if ($self->{simulate} eq "1") {
		$self->{verbose} = "2";
	}
	elsif ($self->{simulate} ne "0") {
		croak "invalid value for <simulate>\n";
	}
	return ($self);
}

=item snew

snew is a shortcut for new(). It is called with parameters passed by position instead of using a hash reference.
It is a "quick-and-dirty" ("dirty" being the operational word) method intended for the impatient who does not want to write a script.
B<Assumes that you have a configuration file for MySQL with username and password>.
Parameters are passed in this order:

	host
	database
	source table
	index field
	lookup fields
	lookup table
	destination table
	copy indexes
	verbose
	simulate
	
Here is an example of one-liner normalization call:
 
 perl -e 'use DBSchema::Normalizer; DBSchema::Normalizer->snew(qw(localhost music MP3 \
    album_id album,artist,genre tmp_albums songs 1 1 1))->do()'

Note: ALL 11 parameters must be passed, or an "use of uninitialized value" error is issued.

This one-liner is equivalent to the following script:
 
	#!/usr/bin/perl 
	no warnings; # Yeah. No warnings. I said it is equivalent, 
				 # not recommended.
	no strict;   # Yup. No strict either. 
	use DBSchema::Normalizer;
	$norm = DBSchema::Normalizer->new ( 
	{
		DSN => "DBI:mysql:music;host=localhost;"
			. "mysql_read_default_file=$ENV{HOME}/.my.cnf", 
	  	src_table     => "MP3",
	  	index_field   => "album_id",
	  	lookup_fields => "artist,album,genre",
	  	lookup_table  => "tmp_albums", 
		dest_table    => "songs",
		copy_indexes  =>  1,
		verbose       =>  1,
		simulate      =>  1,
	 });
	$norm->do();

It is definitely not as safe as the normal call. However, TMTOWTDI, and it's your call. I am using it, but I don't recommend it. Read my lips: I DON'T RECOMMEND IT.

=cut

sub snew { # shortcut new (parameters called by position)
	my ($class, $host, $db, $src_table, $index_field, 
		$lookup_fields,	$lookup_table, $dest_table, 
		$copy_indexes, $verbose, $simulate) = @_;
	my $DSN= "DBI:mysql:$db;host=$host;"
		. "mysql_read_default_file=$ENV{HOME}/.my.cnf";
	return new ($class, {
		DSN           => $DSN,
	  	src_table     => $src_table,
	  	index_field   => $index_field,
	  	lookup_fields => $lookup_fields,
	  	lookup_table  => $lookup_table, 
		dest_table    => $dest_table,
		copy_indexes  => $copy_indexes,
		verbose       => $verbose,
		simulate      => $simulate,
	 });
}

=for internal use
(The Destroyer will clean-up DBI objects.)

=cut

# use Data::Dumper;
sub DESTROY {
	my $self = shift;
	# print STDERR Data::Dumper->Dump([$self],["InDestroyer"]);
	$self->{_dbh}->disconnect();
	undef $self->{_dbh};
}

=item do

do();

do() performs the Normalization according to the parameters already received. Internally calls I<create_lookup_table()> and I<create_normalized_table()>

Will fail if not enough parameters have been supplied to new()

=cut

sub do {
	my $self = shift;
	return 0 unless $self->_init();
	$self->create_lookup_table();
	$self->create_normalized_table();
	return 1;
}

=for internal use
(Checks that given keys in internal blessed hash are defined)

=cut

sub _init_field {
	my $self = shift;
	my @fields = @_;
	my $def = 1;
	foreach (@fields) {
		if (!defined $self->{$_}) {
			$self->_verbose("0", "missing $_\n");
			return 0;
		}
	}
	return 1;



( run in 1.242 second using v1.01-cache-2.11-cpan-39bf76dae61 )