DBSchema-Normalizer

 view release on metacpan or  search on metacpan

Normalizer.pm  view on Meta::CPAN

	 I. Read all records into memory
	II. for each record, identify the unique value for the fields to be
	    moved into a lookup table and store it in a hash
	II. again, for each record, find the corresponding value in the 
	    previously saved hash and save the non-lookup fields plus the 
		unique key into a new table
	IV. for each key in the hash, write the values to a lookup table

I can find four objections against such attempt:

1. Memory problems. The source table can be very large (and some of the table I had to normalize were indeed huge. This kind of solution would have crashed any program trying to load them into memory.) Instead of reading the source table into memory,...

2. Data specific solution. To implement this algorithm, we need to include details specific to our particular records in our code. It is not a good first step toward re-utilization.

3. Data conversion. We need to fetch data from the database, eventually transform it into suitable formats for our calculation and then send it back, re-packed in database format. Not always an issue, but think about the handling of floating point fi...

4. Finally, I would say that this kind of task is not your job. Nor is Perl's. It belongs in the database engine, which can easily, within its boundaries, identify unique values and make a lookup table out of them. And it can also easily make a join ...

That said, the solution springs to mind. Let the database engine do its job, and have Perl drive it towards the solution we need to achieve.  The algorithm is based upon the fact that a table created from a SELECT DISTINCT statement is guaranteed to ...

The algorithm takes four steps:

docs/Normalizer.html  view on Meta::CPAN

<P>The concept behind <STRONG>DBSchema::Normalizer</STRONG> is based upon some DBMS properties. To replace repeating fields with a foreign key pointing to a lookup table, you must be sure that for each distinct set of values you have a distinct forei...
<PRE>
         I. Read all records into memory
        II. for each record, identify the unique value for the fields to be
            moved into a lookup table and store it in a hash
        II. again, for each record, find the corresponding value in the 
            previously saved hash and save the non-lookup fields plus the 
                unique key into a new table
        IV. for each key in the hash, write the values to a lookup table</PRE>
<P>I can find four objections against such attempt:</P>
<P>1. Memory problems. The source table can be very large (and some of the table I had to normalize were indeed huge. This kind of solution would have crashed any program trying to load them into memory.) Instead of reading the source table into memo...
<P>2. Data specific solution. To implement this algorithm, we need to include details specific to our particular records in our code. It is not a good first step toward re-utilization.</P>
<P>3. Data conversion. We need to fetch data from the database, eventually transform it into suitable formats for our calculation and then send it back, re-packed in database format. Not always an issue, but think about the handling of floating point...
<P>4. Finally, I would say that this kind of task is not your job. Nor is Perl's. It belongs in the database engine, which can easily, within its boundaries, identify unique values and make a lookup table out of them. And it can also easily make a jo...
<P>That said, the solution springs to mind. Let the database engine do its job, and have Perl drive it towards the solution we need to achieve.  The algorithm is based upon the fact that a table created from a SELECT DISTINCT statement is guaranteed ...
<P>The algorithm takes four steps:</P>
<P>I. create the lookup table</P>
<PRE>
        CREATE TABLE lookup ({lookupID} INT NOT NULL auto_increment 
                  primary key, {LOOKUP FIELDS});</PRE>
<PRE>

docs/Normalizer.pod  view on Meta::CPAN

	 I. Read all records into memory
	II. for each record, identify the unique value for the fields to be
	    moved into a lookup table and store it in a hash
	II. again, for each record, find the corresponding value in the 
	    previously saved hash and save the non-lookup fields plus the 
		unique key into a new table
	IV. for each key in the hash, write the values to a lookup table

I can find four objections against such attempt:

1. Memory problems. The source table can be very large (and some of the table I had to normalize were indeed huge. This kind of solution would have crashed any program trying to load them into memory.) Instead of reading the source table into memory,...

2. Data specific solution. To implement this algorithm, we need to include details specific to our particular records in our code. It is not a good first step toward re-utilization.

3. Data conversion. We need to fetch data from the database, eventually transform it into suitable formats for our calculation and then send it back, re-packed in database format. Not always an issue, but think about the handling of floating point fi...

4. Finally, I would say that this kind of task is not your job. Nor is Perl's. It belongs in the database engine, which can easily, within its boundaries, identify unique values and make a lookup table out of them. And it can also easily make a join ...

That said, the solution springs to mind. Let the database engine do its job, and have Perl drive it towards the solution we need to achieve.  The algorithm is based upon the fact that a table created from a SELECT DISTINCT statement is guaranteed to ...

The algorithm takes four steps:

docs/Normalizer.txt  view on Meta::CPAN


    I can find four objections against such attempt:

    1. Memory problems. The source table can be very large (and some of the
    table I had to normalize were indeed huge. This kind of solution would
    have crashed any program trying to load them into memory.) Instead of
    reading the source table into memory, we could just read the records
    twice from the database and deal with them one at the time. However,
    even the size of the hash could prove to be too much for our computer
    memory. A hash of 2_000_000 items is unlikely to handle memory
    efficiently in most nowadays desktops.

    2. Data specific solution. To implement this algorithm, we need to
    include details specific to our particular records in our code. It is
    not a good first step toward re-utilization.

    3. Data conversion. We need to fetch data from the database, eventually
    transform it into suitable formats for our calculation and then send it
    back, re-packed in database format. Not always an issue, but think about
    the handling of floating point fields and timestamp fields with reduced
    view. Everything can be solved, but it could be a heavy overhead for



( run in 0.319 second using v1.01-cache-2.11-cpan-299005ec8e3 )