App-combinesheets
view release on metacpan or search on metacpan
Book 2 from B2-e Katrin
authors.tsv:
Age Name
28 Kim
20 Katrin
30 Blanka
50 Lazy author
The output (again, depending on which input is considered a primary
input) will be (a list of included column is defined in the
configuration file - see later):
combinesheets -cfg books_to_authors.cfg -in BOOK=books.tsv AUTHOR=authors.tsv
Name Title Age Note
Blanka Book 1 30 from B1-d
Katrin Book 3 20 from B3-c
Katrin Book 2 20 from B2-e
Kim Book 1 28 from B1-a
Kim Book 2 28 from B2-b
combinesheets -cfg books_to_authors.cfg -in AUTHOR=authors.tsv BOOK=books.tsv
Name Title Age Note
Blanka Book 1 30 from B1-d
Katrin Book 3 20 from B3-c
Katrin Book 2 20 from B2-e
Kim Book 1 28 from B1-a
Kim Book 2 28 from B2-b
Lazy author 50
ADVANCED USAGE
Additionally to the merging columns from one or more spreadsheets, this
script can also add completely new columns to the resulting spreadsheet,
the columns that do not exist in any of the input spreadsheet. Such
columns are called "calculated columns".
Each "calculated column" is created either by an external, command-line
driven, program, or by a Perl subroutine. In both cases, the user must
create (write) such external program or such Perl subroutine. Therefore,
this usage is meant more for developers than for the end users.
Note that this advanced feature is meant only for new columns, not for
new rows. Therefore, it cannot be used, for example, to create rows with
totals of columns.
Calculated columns by external programs
If specified, an external program is invoked for each row. It can be
specified either by a keyword PROG or by a keyword PROGS - see syntax in
the *configuration* section. In both cases, the value of the standard
output of these programs become the value of the calculated column (a
trailing newline of this standard output is removed and other newlines
are replaced by spaces).
A program defined by the PROGS is called without any arguments ("S" in
*PROGS* stands for a *Simple*). That's why it does not have any
knowledge for which row it has been invoked. Its usage is, therefore,
for column values that are not dependent on other values from the
spreadsheet. For example, for the "cars.tsv" shown above, you can add a
column "Last updated" by calling a UNIX program "date" - again, see an
example the *configuration* section.
A program defined by the PROG is called with one argument which is a
filename. This file contains the current row; each of its lines has two,
TAB-separated, fields. The first field is the column name and the second
field is the column value. For example, when processing the last row of
the "cars.tsv" given above, the file will have the following content:
Model Skoda
Year 2002
Owned by Senger
The files are only temporary and will be removed when "combinesheets"
finishes.
Calculated columns by a Perl subroutine
If specified by the keyword PERL, a Perl subroutine is called for each
row with the three arguments:
1 A hashref with information about the current column. Not often used
but may be handy if the same subroutine deals with more columns and,
therefore, needs to know for which column it was invoked. See the
*flights* example in the *configuration* section.
2 An arrayref with all column names.
3 An arrayref with all column values - in the same order as the column
names.
Actually, depending how the subroutine is defined in the configuration,
it may get as the first argument the module/class name where it belongs
to. If you define it like this:
PERL Module::Example::test
the "test" subroutine is called, indeed, with the three arguments as
described above. However, if your definition is rather:
PERL Module::Example->test
then the "test" subroutine is considered a Perl method and its first
argument is the module/class name. It is up to you to decide how you
want/need to write your functions. Again, an example is available in the
*configuration* section.
The return value of the subroutine will become a new value in the
calculated column. Do not return undef but rather an empty string if the
value cannot be created.
What is an advantage of writing my own module/package if I can simply
write an external program (perhaps also in Perl) doing exactly the same?
The Perl module stays in the memory for the whole time of processing all
input rows and, therefore, you can re-use some calculations done for the
previous rows. An example about it ("flights") is given in the
*configuration* section.
ARGUMENTS and OPTIONS
-config <config-file>
A filename with a configuration file. This is a mandatory parameter.
The configuration file describes:
There should be one MATCH line for each input spreadsheet. The data
in the column defined by the "column-header" will be used to find
the corresponding lines. In our example, the data in the column
*Surname* in the "persons.tsv" will be matched with the data in the
column *Owned by* in the "cars.tsv" (the rows having the same values
in these two columns will be merged into one resulting row).
Advanced configuration
If you want to add so-called *calculated columns* as described in
the "ADVANCED USAGE" you need to use few additional reserved words
in the configuration file. These words are PROG, PROGS and/or PERL.
They are used in the place where the new calculated column should be
placed. Their lines have the program name or the Perl subroutine
name in the second column, and they have mandatory third column with
the resulting name of the calculated column.
For example, we wish to add two columns to the input spreadsheet
"cars.tsv". The input file (the same as in the introduction) is:
Model Year Owned by
Praga 1936 Someone else
Mini 1968 Gudernova
Skoda 2002 Senger
We wish to add a column *Car age* that shows the difference between
the actual year and the value from the *Year* column. We have a
shell script "age.sh" doing it:
#!/bin/bash
YEAR=`grep Year $1 | cut -f2`
NOW=`date +%Y`
echo $(($NOW-$YEAR))
The configuration file "cars.cfg" (assuming that we want the other
columns to remain the same) is:
MATCH CAR=Owned by
CAR Owned by
CAR Model
CAR Year
PROG age.sh Car age
When we run:
combinesheets -config cars.cfg -inputs CAR=cars.tsv
we get this result:
Owned by Model Year Car age
Gudernova Mini 1968 44
Senger Skoda 2002 10
Someone else Praga 1936 76
You can see that there is no need to use "combinesheets" for really
combining *more* sheets, an input can be just one sheet.
Another example adds a *fixed* column to the same input, a column
named *Last updated* that gets its value from a UNIX command "date".
This program does not get any information which row it has been
invoked for. The configuration file is now (note the new line with
the PROGS):
MATCH CAR=Owned by
CAR Owned by
CAR Model
CAR Year
PROG age.sh Car age
PROGS date Last updated
and the result is now:
Owned by Model Year Car age Last updated
Gudernova Mini 1968 44 Mon Feb 27 12:32:04 AST 2012
Senger Skoda 2002 10 Mon Feb 27 12:32:04 AST 2012
Someone else Praga 1936 76 Mon Feb 27 12:32:04 AST 2012
The last possibility is to call a Perl subroutine - using the
reserved word PERL in the configuration file. Let's have an input
spreadsheet ("flights.tsv") with data about flights:
Date Flight Airport From Airport To
2009-01-18 AY838 London LHR Helsinki Vantaa
2009-01-22 AY839 Helsinki Vantaa London LHR
2009-03-15 NW2 Manila Tokyo Narita
2009-03-21 NW1 Tokyo Narita Manila
2011-05-06 SV326 Sharm El Sheik Jeddah
2011-07-31 RJ700 Amman Jeddah
2011-09-21 ME369 Jeddah Beirut
2011-09-24 ME368 Beirut Jeddah
2011-12-02 EZY3064 Prague London Stansted
2011-12-09 EZY3067 London Stansted Prague
2012-01-26 MS663 Cairo Jeddah
We want to add columns with the international airport codes for both
*Airport From* and *Airport To*. The new columns will be named *Code
From* and *Code To*. The Perl subroutine will use a web service to
find the code. The subroutine will use a closure that will remember
already fetched codes so the web service does not need to be called
several times for the same airport name.
The configuration file "flights.cfg" is:
MATCH FLY=Date
FLY Date
FLY Flight
FLY Airport From
PERL Airport->find_code Code From
FLY Airport To
PERL Airport->find_code Code To
The name of the subroutine is attached to the module where it comes
from by either :: or -> notation.
The invocation is:
combinesheets -config flights.cfg -inputs FLY=flights.tsv
The full code for the module "Airport", the file "Airport.pm" is
here:
package Airport;
use warnings;
use strict;
use LWP::Simple;
use JSON;
# preparing a closure in order not to fetch the same airport code again and again
my $already_found = make_already_found();
sub make_already_found {
my $already_found = {};
( run in 1.219 second using v1.01-cache-2.11-cpan-0bb4e1dffa6 )