desktop results from the CPAN

desktop
GRID-Machine
view release on metacpan or search on metacpan
lib/GRID/Machine/perlparintro.pod view on Meta::CPAN
    51    my $m = $machine{$hn};
    52    ($proc[$_], $pid[$_]) = $m->open("./pi $_ $N $np |");
    53    $readset->add($proc[$_]);
    54    my $address = 0+$proc[$_];
    55    $id{$address} = $_;
    56  }

During the last stage the master node simply waits in the L<IO::Select>
object listening on each of the channels. As soon as a result is received
it is added to the total sum for C<$pi>:

    58  my @ready;
    59  my $count = 0;
    60  do {
    61    push @ready, $readset->can_read unless @ready;
    62    my $handle = shift @ready;
    63
    64    my $me = $id{0+$handle};
    65
    66    my ($partial);
    67    my $numBytesRead = sysread($handle,  $partial, 1024);
    68    chomp($partial);
    69
    70    $pi += $partial;
    71    print "Process $me: machine = $machine[$me % $nummachines] partial = $partial pi = $pi\n";
    72
    73    $readset->remove($handle) if eof($handle);
    74  } until (++$count == $np);
    75
    76  my $elapsed = tv_interval ($t0);
    77  print "Pi = $pi. N = $N Time = $elapsed\n";


=head1 PERFORMANCE: COMPUTATIONAL RESULTS

Let us see the time it takes the execution of the I<pure C> program on each
of the involved nodes (nereida, beowulf and orion). To have an idea of how things work
for a comptuation large enough we set C<$N> to C<1 000 000 000> intervals:

    pp2@nereida:~/LGRID_Machine/examples$ time ssh nereida 'pi/pi 0 1000000000 1'
    3.141593

    real    0m32.534s
    user    0m0.036s
    sys     0m0.008s

    pp2@nereida:~/LGRID_Machine/examples$ time ssh beowulf 'pi/pi 0 1000000000 1'
    3.141593

    real    0m27.020s
    user    0m0.036s
    sys     0m0.008s

    casiano@beowulf:~$ time ssh orion 'pi/pi 0 1000000000 1'
    3.141593

    real    0m29.120s
    user    0m0.028s
    sys     0m0.003s

As you can see, there is some heterogeneity here. Machine C<nereida> (my desktop)
is slower than the others two. C<beowulf> is the fastest.

Now let us run the parallel perl program in C<nereida> using only the C<beowulf>
node.  The time spent is roughly comparable to the I<pure C> time. That is nice:
The overhead introduced by the coordination tasks is not as large (compare it
with the C<beowulf> entry above):

    pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 1 1000000000
    Process 0: machine = beowulf partial = 3.141593 pi = 3.141593
    Pi = 3.141593. N = 1000000000 Time = 27.058693

    real    0m28.917s
    user    0m0.584s
    sys     0m0.192s

Now comes the true test: will it be faster using two nodes? how much?

    pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 2 1000000000
    Process 0: machine = beowulf partial = 1.570796 pi = 1.570796
    Process 1: machine = orion partial = 1.570796 pi = 3.141592
    Pi = 3.141592. N = 1000000000 Time = 15.094719

    real    0m17.684s
    user    0m0.904s
    sys     0m0.260s

We can see that the sequential pure C version took 32 seconds in my desktop (C<nereida>).
By using two machines I have SSH access I have reduced that time to roughly 18 seconds.
This a factor of C<32/18 = 1.8> times faster. This factor is even better if I
don't consider the set-up time: C<32/15 = 2.1>. The total time decreases
if I use the three machines:

    pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 3 1000000000
    Process 0: machine = beowulf partial = 1.047198 pi = 1.047198
    Process 1: machine = orion partial = 1.047198 pi = 2.094396
    Process 2: machine = nereida partial = 1.047198 pi = 3.141594
    Pi = 3.141594. N = 1000000000 Time = 10.971036

    real    0m13.700s
    user    0m0.952s
    sys     0m0.240s

which gives a speed factor of C<32/13.7 = 2.3> or not considering
the set-up time C<32/10.9 = 2.9>.

What happens if you have multiprocessor machine. The results highly
depend on the underlying architecture. My machine C<nereida> is a dual Xeon: 

  nereida:/tmp/graphviz-2.20.2# cat /proc/cpuinfo
  processor       : 0
  vendor_id       : GenuineIntel
  cpu family      : 15
  model           : 2
  model name      : Intel(R) Xeon(TM) CPU 2.66GHz
  stepping        : 5
  cpu MHz         : 2658.041
  cache size      : 512 KB
  physical id     : 0
  .......................................

  processor       : 1
  vendor_id       : GenuineIntel
  cpu family      : 15
  model           : 2
  model name      : Intel(R) Xeon(TM) CPU 2.66GHz
  stepping        : 5
  cpu MHz         : 2658.041
  cache size      : 512 KB
  physical id     : 0
  ...................................

After changing the C<Makefile> to include the C<-O3> option and the
line defining the set of machines in C<gridpipes.pl>
(addresses in the subnetwork 127.0.0 are mapped to localhost):

  my @machine = qw{127.0.0.1 127.0.0.2 127.0.0.3 127.0.0.4};

We have the following results:

  pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 1 1000000000
  Process 0: machine = 127.0.0.1 partial = 3.141593 pi = 3.141593
  Pi = 3.141593. N = 1000000000 Time = 32.968117

  real    0m33.858s
  user    0m0.336s
  sys     0m0.128s
( run in 1.994 second using v1.01-cache-2.11-cpan-e1769b4cff6 )