DBSchema-Normalizer

 view release on metacpan or  search on metacpan

docs/Normalizer.pod  view on Meta::CPAN

	    username      => $username,
	    password      => $password,
	    src_table     => $sourcetable,
	    index_field   => $indexfield,
	    lookup_fields => $lookupfields, # comma separated list
	    lookup_table  => $lookuptable,
	    dest_table    => $dest_table,
	    copy_indexes  => "yes", 
	});
  $norm->do();  # Just Do It!

  # Alternatively, you can have some more control, by
  # creating the lookup table and normalized table separately,
  # especially useful if one of them is an intermediate step.
  #
  use DBSchema::Normalizer qw(create_lookup_table create_normalized_table);
  my $norm = DBSchema::Normalizer->new( 
	{ 
	    DSN      => $DSN,
	    username => $username,
	    password => $password
	});
  $norm->create_lookup_table (
	  {
	     src_table     => $tablename,
	     index_field   => $indexfield,
	     lookup_fields => $lookupfields,
	     lookup_table  => $lookuptable
      });
  $norm->create_normalized_table (
	  {
	     src_table     => $tablename,
	     index_field   => $indexfield,
	     lookup_fields => $lookupfields,
	     lookup_table  => $lookuptable,
	     dest_table    => $dest_table,
	     copy_indexes  => "yes",
      });

=head1 DESCRIPTION

B<DBSchema::Normalizer> is a module to help transforming MySQL database tables from 1st to 2nd normal form.
Simply put, it will create a lookup table out of a set of repeating fields from a source table, and replace such fields by a foreign key that points to the corresponding fields in the newly created table.
All information is taken from the database itself. There is no need to specify existing details. 
The module is capable of re-creating existing indexes, and should deal with complex cases where the replaced fields are part of a primary key.

=head2 Algorithm

The concept behind B<DBSchema::Normalizer> is based upon some DBMS properties. To replace repeating fields with a foreign key pointing to a lookup table, you must be sure that for each distinct set of values you have a distinct foreign key. You might...

	 I. Read all records into memory
	II. for each record, identify the unique value for the fields to be
	    moved into a lookup table and store it in a hash
	II. again, for each record, find the corresponding value in the 
	    previously saved hash and save the non-lookup fields plus the 
		unique key into a new table
	IV. for each key in the hash, write the values to a lookup table

I can find four objections against such attempt:

1. Memory problems. The source table can be very large (and some of the table I had to normalize were indeed huge. This kind of solution would have crashed any program trying to load them into memory.) Instead of reading the source table into memory,...

2. Data specific solution. To implement this algorithm, we need to include details specific to our particular records in our code. It is not a good first step toward re-utilization.

3. Data conversion. We need to fetch data from the database, eventually transform it into suitable formats for our calculation and then send it back, re-packed in database format. Not always an issue, but think about the handling of floating point fi...

4. Finally, I would say that this kind of task is not your job. Nor is Perl's. It belongs in the database engine, which can easily, within its boundaries, identify unique values and make a lookup table out of them. And it can also easily make a join ...

That said, the solution springs to mind. Let the database engine do its job, and have Perl drive it towards the solution we need to achieve.  The algorithm is based upon the fact that a table created from a SELECT DISTINCT statement is guaranteed to ...

The algorithm takes four steps:

I. create the lookup table

	CREATE TABLE lookup ({lookupID} INT NOT NULL auto_increment 
	          primary key, {LOOKUP FIELDS});

	#(and add a key for each {LOOKUP FIELDS})

II. fill in the lookup table
	
	INSERT INTO lookup 
	SELECT DISTINCT NULL {LOOKUP FIELDS} FROM source_table;
	#(the {lookupID} is automatically created, being auto_increment)

III. create the normalized table

	CREATE TABLE norm_table ({source_table FIELDS} - 
	         {LOOKUP FIELDS} + {lookupID}) 

IV. fill in the normalized table

	INSERT INTO normalized table 
	SELECT {source_table FIELDS} - {LOOKUP FIELDS} + {lookupID}
	FROM source_table 
	INNER JOIN lookup 
	    on (source_table.{LOOKUP FIELDS}= lookup.{LOOKUP FIELDS}) 

As you can see, the entire operation is run in the server workspace, thus avoiding problems of (a) fetching records (less network traffic), (b) handling data conversion, (c) memory occupation and (d) efficiency.

Let's see an example.

Having a table MP3 with these fields

 mysql> describe MP3;
 +----------+-------------+------+-----+----------+----------------+
 | Field    | Type        | Null | Key | Default  | Extra          |
 +----------+-------------+------+-----+----------+----------------+
 | ID       | int(11)     |      | PRI | NULL     | auto_increment |
 | title    | varchar(40) |      | MUL |          |                |
 | artist   | varchar(20) |      | MUL |          |                |
 | album    | varchar(30) |      | MUL |          |                |
 | duration | time        |      |     | 00:00:00 |                |
 | size     | int(11)     |      |     | 0        |                |
 | genre    | varchar(10) |      | MUL |          |                |
 +----------+-------------+------+-----+----------+----------------+
 7 rows in set (0.00 sec)

We want to produce two tables, the first one having only [ID, title, duration, size], while the second one should get [artist, album, genre]. (The second one will also needed to be further split into [artist] and [album, genre] but we can deal with t...

Here are the instructions to normalize this table:



( run in 0.948 second using v1.01-cache-2.11-cpan-39bf76dae61 )