Acme-OneHundredNotOut
view release on metacpan or search on metacpan
OneHundredNotOut.pm view on Meta::CPAN
writing about Perl will forgive me a certain spot of self-indulgence as
I look back over my CPAN career, especially since I feel that the
diversity of modules that I've produced is a good indication of the
diversity of what can be done with Perl.
Let's begin, then, with some humble beginnings, and then catch up on
recent history.
=head2 The Embarrassing Past
Contrary to popular belief, I was not always a CPAN author. I started
writing modules in 1998, immediately after reading the first edition of
the Perl Cookbook - yes, you can blame Nat and Tom for all this. The
first module that I released was L<Tie::DiscoveryHash>, since I'd just
learnt about tied hashes. As with many of my modules, it was an integral
part of another software project which I actually never finished, and
now can't find.
The first module that I ever B<wrote> (but, by a curious quirk of fate,
precisely the fiftieth module I released) was called L<String::Tokeniser>,
which is still a reasonably handy way of getting an iterator over
OneHundredNotOut.pm view on Meta::CPAN
which makes me laugh.) This too was for an abortive project, C<webperl>,
an application of Don Knuth's WEB system of structured documentation to
Perl. However, given the code quality of these two modules, it's perhaps
just as well that the projects never saw the light of day.
There are a few other modules I'd rather like to forget, too.
C<Devel::Pointer> was a sick joke that went badly wrong - it allowed
people to use pointers in Perl. Some people failed to notice that
referring to memory locations directly in an extremely high-level
language was a dangerous and silly thing to do, and actually used the
damned thing, and I started getting requests for support for it. Then at
some point in 2001, when I should really have known better, I developed
an interest in Microsoft's .NET and the C# language, which I still think
is pretty neat; but I decided it might be a good idea to translate the
Mono project's tokenizer and parser into Perl, ending up with
L<C::Sharp>. I never got around to doing the parser part, or indeed
anything else with it, and so it died a lonely death in a dark corner of
CPAN. L<GTK::HandyClist> was my foray into programming graphical
applications, which started and ended there. L<Bundle::SDK::SIMON> was
actually the slides from a talk on my top ten favourite CPAN modules -
except that this changes so quickly over time, it doesn't really make
much sense any more.
Finally, L<Array::FileReader> was an attempt to optimize a file access
process. Unfortunately, my "optimization" ended up introducing more
overheads than the naive solution. It all goes to show. Since then,
Mark-Jason Dominus, another huge influence in the development of my CPAN
career, has written C<Tie::File>, which not only has a better name but
is actually efficient too.
OneHundredNotOut.pm view on Meta::CPAN
think about writing a config file parser again. It's done once, forever.
L<Getopt::Auto> applies the same design principles to handling command
line arguments - I hate forgetting how to use C<Getopt::Long>.
Other attempts at making things simple for the end-user weren't that
successful. As part of writing my (first) mail archiving and indexing
program, C<Mail::Miner>, of which more later, I wanted a nice way for
users to specify a time period in which they're looking for mails - "a
week ago", "sometime last summer", "near the beginning of last month" -
and so on. L<Date::PeriodParser> would take these descriptions and turn
them into a start and end time in which to search. Except, of course,
that this is a very hard thing to do and requires a lot of heuristics,
and while I started off quite well, as ever, I got distracted with other
interesting and considerably more tractable problems.
=head2 Mail Handling
A good number of my Perl modules focussed on mail handling, so many that
I was actually able to get a job basically doing mail processing in
Perl. It all started with L<Mail::Audit>. I was introduced to
F<procmail> at University, and it was useful enough, but it kept having
locking problems and losing my mail, and I didn't really understand it,
to be honest, so I wanted to write my mail filtering rules in Perl.
C<Mail::Audit> worked well for a couple of years before it grew into an
obese monster. I actually only use a very old version of C<Mail::Audit>
on my production server.
As part of the attempt to slim it back down again, I abstracted out one
of the major parts of its functionality, delivering an email to a local
mailbox. Now I only use mbox files, so it was reasonably easy for me,
OneHundredNotOut.pm view on Meta::CPAN
=head2 Linguistics
I'm actually a linguist by training, not a computer programmer,
graduating from the school of Oriental Studies with second and third
year options in Japanese linguistics. I'd like to think that my work at
Kasei was as much about linguistic and textual analysis as it was about
mail munging. With that in mind, I wrote a few language-related modules
during my time with them.
The first important module, which I started work on while I was playing
with C<Mail::Miner>, was L<Lingua::EN::Keywords>. This started life as a
relatively naive algorithm for picking common words out of a text in an
attempt to provide some keywords to describe what the text is "about", and
has matured into quite a handy little automatic topic recognition
module. Its natural counterpart is L<Lingua::EN::NamedEntity>, which
B<is> still a naive algorithm but sometimes those are the best ones.
This module has a bit of story behind it. While analysing mails we were
trying to find people, places, times, and other things we could link
together into a knowledge base. The technical term for this is named
entity extraction. I find a useful library to do this, called C<GATE>.
It's written in Java, which meant using C<Inline::Java>, and is
extremely slow and complex. At the same time, I was writing a chapter on
computational linguistics with Perl in Advanced Perl Programming, and
wanted to talk about named entity extraction. Unfortunately, I only had
one module which did this, L<GATE::ANNIE::Simple>, and it was a hack. If
you're going to talk about a subject, it makes sense to compare and
contrast different solutions, and Tony had already been saying "why
don't you just write something to pull out capitalized phrases, for
starters?" I did this, intending to use it as a baseline, but of course
it's much faster than C<GATE> and not noticably less accurate. Ho hum.
Another thing those wacky computational linguists do a lot of is
working with n-gram streams. In every discipline, there's a particular
hammer you can use to solve any given problem. In data mining, it's
called market basket analysis. In computational linguistics, it's
maximal entropy. You look at the past stream of n characters (that's an
n-gram) and work out how hard it is to see what's coming next.
For instance, if I feed you the 4-gram C<xylo> the chances of a C<p>
OneHundredNotOut.pm view on Meta::CPAN
needed a way of representing the state of a poker game, so I wrote
L<Games::Poker::TexasHold'em> to do that. And also because it was a
fantastic abuse of the C<'> package separator.
Oh, and another of my early modules that refused to die was
L<Oxford::Calendar>, which converts between the academic calendar and
the rest of the world's. It all counts, you know.
=head2 The Future
I've had mixed feelings on Perl 6, starting with my very public
nightmare at its announcement in 1999, (Hey, I'd just written a book on
Perl 5 internals, and now they're telling me it's obsolete.) and then my
very public repentance in 2000, at which point I was very excited about
the whole thing. So much so, that I produced vast numbers of design
documents for the language, most of which now ignored, but that's OK,
and set to work helping Dan design the core of the interpreter too. In
fact, I somehow managed to do so much work on it that, after a hacking
session together at O'Reilly in Boston in 2001, Dan let me be the
release pumpking of L<parrot>, a job I did until life got busy in 2002.
I'm extremely happy to have been involved in that, and hope I didn't
start the project off on too much of a bad footing. It looks to be doing
fine now, at least.
I was still interested in how they're going to make the Perl 6 parser
work, (I still am, but don't have enough time to throw at the problem)
and with my linguistic background I've always been interested in writing
parsers in general. So early on I started trying to write a
L<Perl6::Tokener>, which is now unfortunately quite obsolete, with the
intention of writing a parser later on. For most of 2002, my whiteboard
at home was covered with sketches of the Perl 6 grammar.
Then I found out that the parser is actually going to be dynamic - you
can reconfigure the grammar at runtime. Hey, I thought, that's going to
be fun. At this point, you can't use an ordinary state-table parser like
C<yacc>, as Perl has done so far, because that pre-computes the
transitions up front. Instead, you have to use a proper state machine
without pre-computed tables. But I couldn't find any parsers which
( run in 0.358 second using v1.01-cache-2.11-cpan-05444aca049 )