App-combinesheets

 view release on metacpan or  search on metacpan

docs/App-combinesheets.html  view on Meta::CPAN

<p>The command-line arguments and options can be specified with single or double dash. Most of them can be abbreviated to the nearest non-biased length. They are case-sensitive.</p>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="Duplicated_values_in_the_matching_columns"
>Duplicated values in the matching columns</a></h2>

<p>If there are repeated (the same) values in the column that serves as matching criterion then the resulting spreadsheet will have as many output lines (for a particular matching value) as is the number of all combinations of the lines with that mat...

<pre>   books.tsv:
   Title   Note    Author
   Book 1  from B1-a       Kim
   Book 2  from B2-b       Kim
   Book 3  from B3-c       Katrin
   Book 1  from B1-d       Blanka
   Book 2  from B2-e       Katrin

   authors.tsv:
   Age     Name
   28      Kim
   20      Katrin
   30      Blanka
   50      Lazy author</pre>

<p>The output (again, depending on which input is considered a primary input) will be (a list of included column is defined in the configuration file - see later):</p>

<pre>   combinesheets -cfg books_to_authors.cfg -in BOOK=books.tsv AUTHOR=authors.tsv

   Name    Title   Age Note
   Blanka  Book 1  30  from B1-d
   Katrin  Book 3  20  from B3-c
   Katrin  Book 2  20  from B2-e
   Kim     Book 1  28  from B1-a
   Kim     Book 2  28  from B2-b

   combinesheets -cfg books_to_authors.cfg -in AUTHOR=authors.tsv BOOK=books.tsv

   Name        Title   Age  Note
   Blanka      Book 1  30   from B1-d
   Katrin      Book 3  20   from B3-c
   Katrin      Book 2  20   from B2-e
   Kim         Book 1  28   from B1-a
   Kim         Book 2  28   from B2-b
   Lazy author         50</pre>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="ADVANCED_USAGE"
>ADVANCED USAGE</a></h1>

<p>Additionally to the merging columns from one or more spreadsheets, this script can also add completely new columns to the resulting spreadsheet, the columns that do not exist in any of the input spreadsheet. Such columns are called <code>calculate...

<p>Each <code>calculated column</code> is created either by an external, command-line driven, program, or by a Perl subroutine. In both cases, the user must create (write) such external program or such Perl subroutine. Therefore, this usage is meant ...

<p>Note that this advanced feature is meant only for new columns, not for new rows. Therefore, it cannot be used, for example, to create rows with totals of columns.</p>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="Calculated_columns_by_external_programs"
>Calculated columns by external programs</a></h2>

<p>If specified, an external program is invoked for each row. It can be specified either by a keyword <b>PROG</b> or by a keyword <b>PROGS</b> - see syntax in the <i>configuration</i> section. In both cases, the value of the standard output of these ...

<p>A program defined by the <b>PROGS</b> is called without any arguments (<code>S</code> in <i>PROGS</i> stands for a <i>Simple</i>). That&#39;s why it does not have any knowledge for which row it has been invoked. Its usage is, therefore, for column...

<p>A program defined by the <b>PROG</b> is called with one argument which is a filename. This file contains the current row; each of its lines has two, TAB-separated, fields. The first field is the column name and the second field is the column value...

<pre>   Model       Skoda
   Year        2002
   Owned by    Senger</pre>

<p>The files are only temporary and will be removed when <code>combinesheets</code> finishes.</p>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="Calculated_columns_by_a_Perl_subroutine"
>Calculated columns by a Perl subroutine</a></h2>

<p>If specified by the keyword <b>PERL</b>, a Perl subroutine is called for each row with the three arguments:</p>

<ol>
<li>A hashref with information about the current column. Not often used but may be handy if the same subroutine deals with more columns and, therefore, needs to know for which column it was invoked. See the <i>flights</i> example in the <i>configurat...

<li>An arrayref with all column names.</li>

<li>An arrayref with all column values - in the same order as the column names.</li>
</ol>

<p>Actually, depending how the subroutine is defined in the configuration, it may get as the first argument the module/class name where it belongs to. If you define it like this:</p>

<pre>   PERL   Module::Example::test</pre>

<p>the <code>test</code> subroutine is called, indeed, with the three arguments as described above. However, if your definition is rather:</p>

<pre>   PERL   Module::Example-&#62;test</pre>

<p>then the <code>test</code> subroutine is considered a Perl method and its first argument is the module/class name. It is up to you to decide how you want/need to write your functions. Again, an example is available in the <i>configuration</i> sect...

<p>The return value of the subroutine will become a new value in the calculated column. Do not return undef but rather an empty string if the value cannot be created.</p>

<p>What is an advantage of writing my own module/package if I can simply write an external program (perhaps also in Perl) doing exactly the same? The Perl module stays in the memory for the whole time of processing all input rows and, therefore, you ...

<h1><a class='u' href='#___top' title='click to go to top of document'
name="ARGUMENTS_and_OPTIONS"
>ARGUMENTS and OPTIONS</a></h1>

<dl>
<dt><a name="-config_&#60;config-file&#62;"
><b>-config &#60;config-file</b>&#62;</a></dt>

<dd>
<p>A filename with a configuration file. This is a mandatory parameter. The configuration file describes:</p>

<ul>
<li>which columns in individual input spreadsheets should be included in the resulting spreadsheet,</li>

<li>what names should be given to the resulting columns</li>

<li>in which order should be the columns in the resulting spreadsheet</li>

<li>which columns should be used to match individual lines,</li>
</ul>

<p>The configuration file is a TAB-separated file (with no header line). Empty lines and lines starting with a &#34;#&#34; character are ignored. Each line has two columns, in some cases there is an optional third column. Here is a configuration file...

docs/App-combinesheets.html  view on Meta::CPAN

   CAR     Model
   PERSON  Sex
   CHILD   Name
   CHILD   Born
   PERSON  Nickname
   PERSON  Age
   CAR     Year
   CAR     Owned by</pre>

<p>The first column is either a reserved word <code>MATCH</code>, or an identifier of an input spreadsheet. There are also few other reserved words - see more about them a bit later.</p>

<p>The identifier can be almost anything (and it does not appear in the input spreadsheet itself). It is also used in the command-line argument <code>-inputs</code> where it corresponds to a real file name of the input. The lines with identifiers def...

<p>The reserved word <code>MATCH</code> is used to define how to match lines in the input spreadsheets. The format of its second column is:</p>

<pre>   &#60;input-ID&#62;=&#60;column-header&#62;</pre>

<p>There should be one MATCH line for each input spreadsheet. The data in the column defined by the &#34;column-header&#34; will be used to find the corresponding lines. In our example, the data in the column <i>Surname</i> in the <code>persons.tsv</...

<p><b>Advanced configuration</b></p>

<p>If you want to add so-called <i>calculated columns</i> as described in the <a href="#ADVANCED_USAGE" class="podlinkpod"
>&#34;ADVANCED USAGE&#34;</a> you need to use few additional reserved words in the configuration file. These words are <b>PROG</b>, <b>PROGS</b> and/or <b>PERL</b>. They are used in the place where the new calculated column should be placed. Their li...

<p>For example, we wish to add two columns to the input spreadsheet <code>cars.tsv</code>. The input file (the same as in the introduction) is:</p>

<pre>   Model  Year  Owned by
   Praga  1936  Someone else
   Mini   1968  Gudernova
   Skoda  2002  Senger</pre>

<p>We wish to add a column <i>Car age</i> that shows the difference between the actual year and the value from the <i>Year</i> column. We have a shell script <code>age.sh</code> doing it:</p>

<pre>   #!/bin/bash
   YEAR=`grep Year $1 | cut -f2`
   NOW=`date +%Y`
   echo $(($NOW-$YEAR))</pre>

<p>The configuration file <code>cars.cfg</code> (assuming that we want the other columns to remain the same) is:</p>

<pre>   MATCH   CAR=Owned by

   CAR     Owned by
   CAR     Model
   CAR     Year
   PROG    age.sh  Car age</pre>

<p>When we run:</p>

<pre>   combinesheets -config cars.cfg -inputs CAR=cars.tsv</pre>

<p>we get this result:</p>

<pre>   Owned by        Model   Year    Car age
   Gudernova       Mini    1968    44
   Senger          Skoda   2002    10
   Someone else    Praga   1936    76</pre>

<p>You can see that there is no need to use <code>combinesheets</code> for really combining <i>more</i> sheets, an input can be just one sheet.</p>

<p>Another example adds a <i>fixed</i> column to the same input, a column named <i>Last updated</i> that gets its value from a UNIX command <code>date</code>. This program does not get any information which row it has been invoked for. The configurat...

<pre>   MATCH   CAR=Owned by

   CAR     Owned by
   CAR     Model
   CAR     Year
   PROG    age.sh  Car age
   PROGS   date    Last updated</pre>

<p>and the result is now:</p>

<pre>   Owned by        Model   Year    Car age   Last updated
   Gudernova       Mini    1968    44        Mon Feb 27 12:32:04 AST 2012
   Senger          Skoda   2002    10        Mon Feb 27 12:32:04 AST 2012
   Someone else    Praga   1936    76        Mon Feb 27 12:32:04 AST 2012</pre>

<p>The last possibility is to call a Perl subroutine - using the reserved word <b>PERL</b> in the configuration file. Let&#39;s have an input spreadsheet (<code>flights.tsv</code>) with data about flights:</p>

<pre>   Date         Flight    Airport From      Airport To
   2009-01-18   AY838     London LHR        Helsinki Vantaa
   2009-01-22   AY839     Helsinki Vantaa   London LHR
   2009-03-15   NW2       Manila            Tokyo Narita
   2009-03-21   NW1       Tokyo Narita      Manila
   2011-05-06   SV326     Sharm El Sheik    Jeddah
   2011-07-31   RJ700     Amman             Jeddah
   2011-09-21   ME369     Jeddah            Beirut
   2011-09-24   ME368     Beirut            Jeddah
   2011-12-02   EZY3064   Prague            London Stansted
   2011-12-09   EZY3067   London Stansted   Prague
   2012-01-26   MS663     Cairo             Jeddah</pre>

<p>We want to add columns with the international airport codes for both <i>Airport From</i> and <i>Airport To</i>. The new columns will be named <i>Code From</i> and <i>Code To</i>. The Perl subroutine will use a web service to find the code. The sub...

<p>The configuration file <code>flights.cfg</code> is:</p>

<pre>   MATCH   FLY=Date

   FLY     Date
   FLY     Flight
   FLY     Airport From
   PERL    Airport-&#62;find_code      Code From
   FLY     Airport To
   PERL    Airport-&#62;find_code      Code To</pre>

<p>The name of the subroutine is attached to the module where it comes from by either <b>::</b> or <b>-&#62;</b> notation.</p>

<p>The invocation is:</p>

<pre>   combinesheets -config flights.cfg -inputs FLY=flights.tsv</pre>

<p>The full code for the module <code>Airport</code>, the file <code>Airport.pm</code> is here:</p>

<pre>   package Airport;
   use warnings;
   use strict;

   use LWP::Simple;
   use JSON;

   # preparing a closure in order not to fetch the same airport code again and again
   my $already_found = make_already_found();
   sub make_already_found {
      my $already_found = {};
      return sub {
         my ($airport_name, $airport_code) = @_;
         if (exists $already_found-&#62;{$airport_name}) {
            if ($airport_code) {
                $already_found-&#62;{$airport_name} = $airport_code;
            }
            return $already_found-&#62;{$airport_name};
         } else {
            $already_found-&#62;{$airport_name} = ($airport_code ? $airport_code : 1);



( run in 0.661 second using v1.01-cache-2.11-cpan-0bb4e1dffa6 )