DBSchema-Normalizer

 view release on metacpan or  search on metacpan

docs/Normalizer.txt  view on Meta::CPAN

      #
      use DBSchema::Normalizer qw(create_lookup_table create_normalized_table);
      my $norm = DBSchema::Normalizer->new( 
            { 
                DSN      => $DSN,
                username => $username,
                password => $password
            });
      $norm->create_lookup_table (
              {
                 src_table     => $tablename,
                 index_field   => $indexfield,
                 lookup_fields => $lookupfields,
                 lookup_table  => $lookuptable
          });
      $norm->create_normalized_table (
              {
                 src_table     => $tablename,
                 index_field   => $indexfield,
                 lookup_fields => $lookupfields,
                 lookup_table  => $lookuptable,
                 dest_table    => $dest_table,
                 copy_indexes  => "yes",
          });

DESCRIPTION
    DBSchema::Normalizer is a module to help transforming MySQL database
    tables from 1st to 2nd normal form. Simply put, it will create a lookup
    table out of a set of repeating fields from a source table, and replace
    such fields by a foreign key that points to the corresponding fields in
    the newly created table. All information is taken from the database
    itself. There is no need to specify existing details. The module is
    capable of re-creating existing indexes, and should deal with complex
    cases where the replaced fields are part of a primary key.

  Algorithm

    The concept behind DBSchema::Normalizer is based upon some DBMS
    properties. To replace repeating fields with a foreign key pointing to a
    lookup table, you must be sure that for each distinct set of values you
    have a distinct foreign key. You might be tempted to solve the problem
    with something like this:

             I. Read all records into memory
            II. for each record, identify the unique value for the fields to be
                moved into a lookup table and store it in a hash
            II. again, for each record, find the corresponding value in the 
                previously saved hash and save the non-lookup fields plus the 
                    unique key into a new table
            IV. for each key in the hash, write the values to a lookup table

    I can find four objections against such attempt:

    1. Memory problems. The source table can be very large (and some of the
    table I had to normalize were indeed huge. This kind of solution would
    have crashed any program trying to load them into memory.) Instead of
    reading the source table into memory, we could just read the records
    twice from the database and deal with them one at the time. However,
    even the size of the hash could prove to be too much for our computer
    memory. A hash of 2_000_000 items is unlikely to handle memory
    efficiently in most nowadays desktops.

    2. Data specific solution. To implement this algorithm, we need to
    include details specific to our particular records in our code. It is
    not a good first step toward re-utilization.

    3. Data conversion. We need to fetch data from the database, eventually
    transform it into suitable formats for our calculation and then send it
    back, re-packed in database format. Not always an issue, but think about
    the handling of floating point fields and timestamp fields with reduced
    view. Everything can be solved, but it could be a heavy overhead for
    your sub.

    4. Finally, I would say that this kind of task is not your job. Nor is
    Perl's. It belongs in the database engine, which can easily, within its
    boundaries, identify unique values and make a lookup table out of them.
    And it can also easily make a join between source and lookup table.

    That said, the solution springs to mind. Let the database engine do its
    job, and have Perl drive it towards the solution we need to achieve. The
    algorithm is based upon the fact that a table created from a SELECT
    DISTINCT statement is guaranteed to have a direct relationship with each
    record of the source table, when compared using the same elements we
    considered in the SELECT DISTINCT.

    The algorithm takes four steps:

    I. create the lookup table

            CREATE TABLE lookup ({lookupID} INT NOT NULL auto_increment 
                      primary key, {LOOKUP FIELDS});

            #(and add a key for each {LOOKUP FIELDS})

    II. fill in the lookup table

            INSERT INTO lookup 
            SELECT DISTINCT NULL {LOOKUP FIELDS} FROM source_table;
            #(the {lookupID} is automatically created, being auto_increment)

    III. create the normalized table

            CREATE TABLE norm_table ({source_table FIELDS} - 
                     {LOOKUP FIELDS} + {lookupID}) 

    IV. fill in the normalized table

            INSERT INTO normalized table 
            SELECT {source_table FIELDS} - {LOOKUP FIELDS} + {lookupID}
            FROM source_table 
            INNER JOIN lookup 
                on (source_table.{LOOKUP FIELDS}= lookup.{LOOKUP FIELDS}) 

    As you can see, the entire operation is run in the server workspace,
    thus avoiding problems of (a) fetching records (less network traffic),
    (b) handling data conversion, (c) memory occupation and (d) efficiency.

    Let's see an example.

    Having a table MP3 with these fields



( run in 0.964 second using v1.01-cache-2.11-cpan-39bf76dae61 )