DBSchema-Normalizer

 view release on metacpan or  search on metacpan

docs/Normalizer.pod  view on Meta::CPAN

	artist_id INT(11) NOT NULL, 
	PRIMARY KEY (album_id), 
	KEY genre (genre), KEY album (album), KEY artist_id (artist_id));
	
	INSERT INTO albums 
	SELECT src.album_id, src.album, src.genre, artist_id 
	FROM tmp_albums src 
	INNER JOIN artists lkp ON (src.artist =lkp.artist);

 mysql> describe artists;
 +-----------+-------------+------+-----+---------+----------------+
 | Field     | Type        | Null | Key | Default | Extra          |
 +-----------+-------------+------+-----+---------+----------------+
 | artist_id | int(11)     |      | PRI | NULL    | auto_increment |
 | artist    | varchar(20) |      | MUL |         |                |
 +-----------+-------------+------+-----+---------+----------------+
 2 rows in set (0.00 sec)
 
 mysql> describe albums;
 +-----------+-------------+------+-----+---------+----------------+
 | Field     | Type        | Null | Key | Default | Extra          |
 +-----------+-------------+------+-----+---------+----------------+
 | album_id  | int(11)     |      | PRI | NULL    | auto_increment |
 | album     | varchar(30) |      | MUL |         |                |
 | genre     | varchar(10) |      | MUL |         |                |
 | artist_id | int(11)     |      | MUL | 0       |                |
 +-----------+-------------+------+-----+---------+----------------+
 4 rows in set (0.00 sec)
 
 mysql> describe songs;
 +----------+-------------+------+-----+----------+----------------+
 | Field    | Type        | Null | Key | Default  | Extra          |
 +----------+-------------+------+-----+----------+----------------+
 | ID       | int(11)     |      | PRI | NULL     | auto_increment |
 | title    | varchar(40) |      | MUL |          |                |
 | duration | time        |      |     | 00:00:00 |                |
 | size     | int(11)     |      |     | 0        |                |
 | album_id | int(11)     |      | MUL | 0        |                |
 +----------+-------------+------+-----+----------+----------------+
 5 rows in set (0.00 sec)
 
It should be clear now WHAT we have to do. Less clear is HOW. The above instructions seem to imply that we manually copy the field structure from the source table to the lookup and normalized tables.

Actually, that SQL code (except the DESCRIBEs) was produced by this very module and printed to the STDOUT, so that all I had to do was some cut-and-paste.
And then we are back to the question of the algorithm. If this is all SQL, where is Perl involved?
The answer is that Perl will reduce the amount of information we need to give to the database engine.
The information about the field structure and indexes is already in the database. Our Perl module (with a [not so] little help from the DBI) can extract the structure from the database and create the appropriate SQL statements.
On the practical side, this means that, before producing SQL code, this module will gather information about the source table. It will issue a "SHOW FIELDS FROM tablename" and a "SHOW INDEX FROM tablename" statements, and parse their results to prepa...

That's it. It seems a rather small contribution to your average workload, but if you ever have to deal with a project involving several large tables, with many fields, to be transformed into many normalized tables, I am sure you will appreciate the G...

BTW, this is the code used to produce the above SQL statements:

	#!/usr/bin/perl -w
	use strict;

	use DBSchema::Normalizer;

	my $norm = DBSchema::Normalizer->new ({
		DSN  => "DBI:mysql:music;host=localhost;"
			 . "mysql_read_default_file=$ENV{HOME}/.my.cnf", 
	  	src_table     => "MP3",
	  	index_field   => "album_id",
	  	lookup_fields => "artist,album,genre",
	  	lookup_table  => "tmp_albums", 
		dest_table    => "songs",
		copy_indexes  =>  1,
		simulate      =>  1
	 });

	$norm->do();

	$norm->create_lookup_table ({ 
	  src_table     => "tmp_albums",
	  index_field   => "artist_id",
	  lookup_fields => "artist",
	  lookup_table  => "artists"
  	});

	$norm->create_normalized_table ({
	  src_table     => "tmp_albums",
	  lookup_table  => "artists",
	  index_field   => "artist_id",
	  lookup_fields => "artist",
	  dest_table    => "albums"
	});

Twenty-five lines of code. Not bad for such a complicated task. But even that could have been replaced by these two one-liners:

 perl -e 'use DBSchema::Normalizer; DBSchema::Normalizer->snew(qw(localhost music MP3 \ 
       album_id album,artist,genre tmp_albums songs 1 1 1))->do()'
	
 perl -e 'use DBSchema::Normalizer; DBSchema::Normalizer->snew(qw(localhost music \ 
    tmp_albums artist_id artist artists albums 1 1 1))->do()'

(See below item "snew" for more details.)
	
One thing that this module won't do for you, though, is to decide which columns should stay with the source table and which ones should go to the lookup table. This is something for which you need to apply some database theory, and I don't expect you...
I am planning (in a very idle and distant way) another module that will analyze a database table and decide if it is a good design or not. The examples from commercial software I have seen so far did not impress me a lot. I am still convinced that hu...

=head2 Simulation mode

This module can do the data transfer for you, or you may want to run it in "simulation mode", by adding simulate => "1" to the constructor parameters. When in simulation mode, the DBSchema::Normalizer will just print the necessary SQL statements to S...

=head2 EXPORT

new, snew, create_lookup_table, create_normalized_table

=head2 DEPENDENCIES

DBI, DBD::mysql

=head2 Architecture

The Normalizer doesn't enforce private data protection. You are only supposed to call the methods which are documented here as public. In the spirit of Perl OO philosophy, nobody will prevent you from calling private methods (the ones beginning with ...

=head2 PORTABILITY

The algorithm used here is general. It was initially developed in C for an embedded RDBMS and there is no reason to assume that it won't work in any other database engine. However, the information about field types and indexes is, in this module, MyS...
To adapt this module for a different database, corresponding SQL statements for the MYSQL specific SHOW INDEX and SHOW FIELDS should be provided. Also the syntax for INNER JOIN might not be portable across databases.

docs/Normalizer.pod  view on Meta::CPAN

	                "yes" or "1"   : indexes from the source table will
	                                 be immediately replicated to the 
	                                 destination table
	                "later" or "2" : indexes will be created after the 
					                 data transfer,
	                                 as an ALTER TABLE statement. 
	                                 It may speed up the insertion
	                                 for large tables.
	verbose         if "1", messages indicating what is going on
	                will be sent to STDOUT. 
	                Using "2", even more verbose information is 
	                given (all queries printed before execution);
	                Level "3" will also show details about src_table 
	                fields and indexes;
	ignore_warning  if "1", warning on missing indexes on lookup fields 
	                are ignored, and the requested operation carried 
					out even at a price of long waiting. Default "0"
	simulate        if "1", no operation is carried out
	                but the queries are printed to STDOUT (as in 
					verbose => 2)

B<note>: src_table, dest_table and lookup_table B<cannot> be called I<src> or I<lkp>, which are used internally by the Normalizer. If such names are given, a fatal error is issued.

If the keys for src_table, index_field, lookup table and fields are missing, they can (actually they MUST) be later provided by calls to create_lookup_table() and create_normalized_table().

=item snew

snew is a shortcut for new(). It is called with parameters passed by position instead of using a hash reference.
It is a "quick-and-dirty" ("dirty" being the operational word) method intended for the impatient who does not want to write a script.
B<Assumes that you have a configuration file for MySQL with username and password>.
Parameters are passed in this order:

	host
	database
	source table
	index field
	lookup fields
	lookup table
	destination table
	copy indexes
	verbose
	simulate
	
Here is an example of one-liner normalization call:
 
 perl -e 'use DBSchema::Normalizer; DBSchema::Normalizer->snew(qw(localhost music MP3 \
    album_id album,artist,genre tmp_albums songs 1 1 1))->do()'

Note: ALL 11 parameters must be passed, or an "use of uninitialized value" error is issued.

This one-liner is equivalent to the following script:
 
	#!/usr/bin/perl 
	no warnings; # Yeah. No warnings. I said it is equivalent, 
				 # not recommended.
	no strict;   # Yup. No strict either. 
	use DBSchema::Normalizer;
	$norm = DBSchema::Normalizer->new ( 
	{
		DSN => "DBI:mysql:music;host=localhost;"
			. "mysql_read_default_file=$ENV{HOME}/.my.cnf", 
	  	src_table     => "MP3",
	  	index_field   => "album_id",
	  	lookup_fields => "artist,album,genre",
	  	lookup_table  => "tmp_albums", 
		dest_table    => "songs",
		copy_indexes  =>  1,
		verbose       =>  1,
		simulate      =>  1,
	 });
	$norm->do();

It is definitely not as safe as the normal call. However, TMTOWTDI, and it's your call. I am using it, but I don't recommend it. Read my lips: I DON'T RECOMMEND IT.

=for internal use
(The Destroyer will clean-up DBI objects.)

=item do

do();

do() performs the Normalization according to the parameters already received. Internally calls I<create_lookup_table()> and I<create_normalized_table()>

Will fail if not enough parameters have been supplied to new()

=for internal use
(Checks that given keys in internal blessed hash are defined)

=for internal use
(_verbose() will print a message, depending on the currently set verbose level)

=for internal use
(_get_indexes() will find the indexes from src_table and set the internal values _primary_key_fields and _normal_fields_indexes with DML instructions to re-create the indexes within a SQL statement - It will identify multiple and unique keys)

=for internal use
(_get_field_descriptions() will extract data definition from src_table and prepare apt statements to re-create the needed fields in dest_table and lookup_table)

=for internal use
(_init() will clean the description fields ane fill them with appropriate calls to _get_field_descriptions() and _get_indexes())
Uncommenting the lines mentioning Data::Dumper will produce useful debug information.

=for internal use
(gets additional parameters into internal hash)

=item create_normalized_table()

create_normalized_table() will create a 2nd normal form table, getting data from a source table and a lookup table.
Lookup fields in the source table will be replaced by the corresponding index field in the lookup table.

If called without parameters, assumes the parameters passed to the object constructor.

Parameters are passed as a hash reference, and are the same given to new() except I<DSN>, I<username> and I<password>. None are compulsory here. The missing ones are taken from the constructor. However, a check is done to ensure that all parameters a...

=item create_lookup_table

create_lookup_table() will create a lookup table, extracting repeating fields from a 1st normal form table. A numeric primary key is created.

When called without parameters, assumes the values passed to the object constructor (I<new>).

Parameters are passed as a hash reference, and should include the following



( run in 0.619 second using v1.01-cache-2.11-cpan-39bf76dae61 )