looks results from the CPAN

B-C
Every `$i` and `$j` got expanded into a literal, 0 .. 4.

I did this loop unrolling for the three functions, and the results
were impressive. It is a nice little macro trick which you could use
for normal uncompiled perl code also.  With compiled code the
loop-unrolling should happen automatically.

Full code here: [nbody.perl-2.perl](https://github.com/rurban/shootout/commit/62b216756320e8c224eef2c933326924ab73c18a)

Original:

    $ perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp ../shootout/bench/nbody/nbody.perl 50000
    script/perlcc: c time: 0.380353
    script/perlcc: cc time: 0.967525
    -0.169075164
    -0.169078071
    script/perlcc: r time: 2.214327

Unrolled:

    $ perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp ../shootout/bench/nbody/nbody.perl-2.perl 50000
    script/perlcc: c time: 0.448817
    script/perlcc: cc time: 2.167499
    -0.169075164
    -0.169078071
    script/perlcc: r time: 1.341283

Another **2x times faster!**

For comparison the same effect uncompiled:

    $ time perl ../shootout/bench/nbody/nbody.perl 50000
    -0.169075164
    -0.169078071

    real	0m3.650s
    user	0m3.644s
    sys	0m0.000s

Unrolled:

    $ time perl ../shootout/bench/nbody/nbody.perl-2.perl 50000
    -0.169075164
    -0.169078071

    real	0m2.399s
    user	0m2.388s
    sys	0m0.004s

So we went from **3.6s** down to **2.4s** and compiled to **1.3s**.

With N=50,000,000 we got **14m12.653s** uncompiled and **7m11.3597s**
compiled. Close to jruby, even if the array accesses still goes
through the `av_fetch` function, magic is checked and undefined indices
are autovivified.


Generalization
--------------

The above macro-code code looks pretty unreadable, similar to lisp
macros, with its mix of quoted and unquoted variables.  The compiler
needs to detect unrollable loop code which will lead to more
constants and AELEMFAST ops. And we better define a helper function
for easier generation of such unrolled loops.

    # unquote local vars
    sub qv {
      my ($s, $env) = @_;
      # expand our local loop vars
      $s =~ s/(\$\w+?)\b/exists($env->{$1})?$env->{$1}:$1/sge;
      $s
    }

    $energy = '
    sub energy
    {
      my $e = 0.0;
      my ($dx, $dy, $dz, $distance);';
      for my $i (0 .. $last) {
        my $env = {'$i'=>$i,'$last'=>$last};
        $energy .= qv('
        # outer-loop $i..4
        $e += 0.5 * $mass[$i] *
              ($vxs[$i] * $vxs[$i] + $vys[$i] * $vys[$i] + $vzs[$i] * $vzs[$i]);', $env);
        for (my $j = $i + 1; $j < $last + 1; $j++) {
          $env->{'$j'} = $j;
          $energy .= qv('
          # inner-loop $j..4
          $dx = $xs[$i] - $xs[$j];
          $dy = $ys[$i] - $ys[$j];
          $dz = $zs[$i] - $zs[$j];
          $distance = sqrt($dx * $dx + $dy * $dy + $dz * $dz);
          $e -= ($mass[$i] * $mass[$j]) / $distance;', $env);
        }
      }
      $energy .= '
      return $e;
    }';
    eval $energy; die if $@;

This looks now much better and leads in a BEGIN block to only neglectible
run-time penalty.
Full code here: [nbody.perl-2a.perl](https://github.com/rurban/shootout/commit/c35bb85ed84941157eb01b7ca844d3b4472e0df3)

I also tried a generic `unroll_loop()` function, but it was a bit too
unstable finding the end of the loop blocks on the source level, and
`qv()` looked good enough. The compiler can use the optree to find the
optimization.


Types and autovivification
--------------------------

A naive optimization would check the index ranges beforehand, and access
the array values directly. Something the type optimizer for arrays would
do.

    my (num @xs[4],  num @ys[4],  num @zs[4]);
    my (num @vxs[4], num @vys[4], num @vzs[4]);
    my num @mass[4];

And instead of `$xs[0] * $xs[1]` which compiles to
AELEMFASTs, currently inlined by B::CC to:

    { AV* av = MUTABLE_AV(PL_curpad[6]);
      SV** const svp = av_fetch(av, 0, 0);
      SV *sv = (svp ? *svp : &PL_sv_undef);
      if (SvRMAGICAL(av) && SvGMAGICAL(sv)) mg_get(sv);
      PUSHs(sv);
    }
    { AV* av = MUTABLE_AV(PL_curpad[6]);
      SV** const svp = av_fetch(av, 1, 0);
      SV *sv = (svp ? *svp : &PL_sv_undef);
      if (SvRMAGICAL(av) && SvGMAGICAL(sv)) mg_get(sv);
      PUSHs(sv);
    }
    rnv0 = POPn; lnv0 = POPn;       /* multiply */
    d30_tmp = lnv0 * rnv0;

It should compile to:

    d30_tmp = (double)AvARRAY(PL_curpad[6])[0] *
              (double)AvARRAY(PL_curpad[6])[1];

With the size declaration you can omit the `av_fetch()` call and undef
check ("autovivification"), with the type `num` you do not need to get
to the `SvNV` of the array element, the value is stored directly, and
the type also guarantees that there is no magic to be checked.  So
`AvARRAY(PL_curpad[6])[0]` would return a double.

And the stack handling (PUSH, PUSH, POP, POP) can also be optimized
away, since the ops are inlined already.  That would get us close to
an optimizing compiler as with Haskell, Lua, PyPy or LISP. Not close
to Go or Java, as their languages are stricter.

I tried a simple B::CC AELEMFAST optimization together with "no autovivification"
which does not yet eliminate superfluous PUSH/POP pairs but could be applied
for typed arrays and leads to another 2x times win.

2.80s down to 1.67s on a slower PC with N=50,000.
( run in 0.727 second using v1.01-cache-2.11-cpan-6aa56a78535 )