Acme-OneHundredNotOut
view release on metacpan or search on metacpan
OneHundredNotOut.pm view on Meta::CPAN
L<Sub::Versive>, for instance, was the first module on CPAN to handle
pre- and post-hooks for a subroutine; it has since been joined by a
plethora of imitators. It was written, though, in response to a peculiar
scenario. I was writing a module (C<Safety::First>) which provided
additional built-in-like functions for Perl to encourage and facilitate
defensive programming and intelligible error reporting. ("Couldn't open
file? Why not?") These built-ins had to be available from every
package, which meant playing with C<UNIVERSAL::AUTOLOAD>. But what if
another package was already using C<UNIVERSAL::AUTOLOAD>? Hence,
C<Sub::Versive> wrapped it in a pre-hook. Of course, with the
interesting bit of the problem solved, C<Safety::First> was abandoned.
L<Class::Dynamic> was an interesting attempt to provide support for code
references in C<@ISA>, analogous to code references in C<@INC>. It
works, but of course I could never find any practical use for it.
L<Class::Wrap> was written as a lazy profiler. A certain application I
was writing for my employer of the time, Kasei, made use of the (IMHO
evil) C<Mail::Message> module. How do we isolate all calls to that
class? There are plenty of modules out there for instrumenting
individual methods, including of course C<Sub::Versive>. But the whole
class? C<Class::Wrap> takes a wonderfully brute-force but workable
approach to the problem. A real profiler, however, can be constructed
from L<Devel::DProfPP>, which is sort of a profiler toolkit.
I wrote a couple of other modules with Kasei in this category,
particularly while working on our Plucene port of the Lucene search
engine. (I guess I could claim C<Plucene> as one of my 100 modules, but
that would be to deny Marc Kerr the recognition he deserves for the work
he put in to packaging, documenting and providing tests for my insane
and scrambled code.) I wrote L<Bit::Vector::Minimal>, for instance, as I
ported C<org.apache.lucene.util.BitVector>; L<Tie::Array::Stored>, which
I'm amazed wasn't already implemented on CPAN, provided the Perl
equivalent of C<org.apache.lucene.util.PriorityQueue>.
L<Lucene::QueryParser>, of course, does what it says on the tin. (I also
produced a couple of add-ons for Plucene after leaving Kasei when I was
doing a bit of Plucene consultancy:
L<Plucene::Plugin::Analyzer::PorterAnalyzer> and
L<Plucene::Plugin::WeightedQueryParser>.)
Another module produced in the course of writing Plucene was
L<Class::HasA>, a handy little utility module which works well with Tony
Bowden's C<Class::Accessor> and merely dispatches certain method calls
to objects contained within your object.
And speaking of C<Class::Accessor>, L<Class::Accessor::Assert> would
have been a godsend while writing Plucene, as it's a version of accessor
handling which typechecks what you're putting into the accessor slots.
When you're converting a typed language into an untyped one, occasional
checks that you're handling the right kind of object don't go amiss. I
learnt my lesson eventually, though, and wrote the module after Plucene
was done.
Another Java-influenced module was C<Attribute::Final>, which was written
for my book Advanced Perl Programming as an example of both attributes
and messing about with the class module - by marking some subtourines as
C<:final>, you get an error if a derived class attempts to override it.
As with many of my proof-of-concept modules, this isn't something I'd
ever use myself, but I know others have used it. I'll let you into a
secret - over the past few months I've settled on giving modules a
version number of C<0.x> if I've never used them myself and C<1.x> if I
have.
Java wasn't the only language to influence my Perl coding activities.
Ruby is a wonderful little language I first encountered in Japan, but
didn't really get into until around 2003. Of course, when you see
another language has dome good ideas, you steal them, which is what I
did with L<rubyisms>, L<SUPER>, and L<Class::SingletonMethod> - all of
which, by the way, are B<excellent> examples of what you can do to the
behaviour of Perl just from pure Perl. C<SUPER> is the kind of module
I've so often wanted to use in production code but never dared.
=head2 Smart Perl
My views on human-computer interface and computer usability have been
unchanged since I wrote C<Tie::DiscoveryHash> way back in the mists of
time. The underlying principle behind that module was simple: the user
should B<never> tell the computer anything it already knows or can
reasonably be expected to work out. C<Tie::DiscoveryHash> was all about
having the computer find out stuff for itself.
This has influenced a number of my modules, which have focussed on
trying to make everything as simple as possible for the user (or more
usually, for the programmer using my modules) and then a bit simpler.
So, for instance, I found the whole process of keeping values persistent
between runs of Perl a bit of a nightmare - I could never remember the
syntax for tying to C<DB_File>, and I would always forget to use the
extremely handy C<MLDBM> module. I just wanted to say "keep this
variable around". L<Attribute::Persistent> does just that, cleanly and
simply. It even works out a sensible place to put the database, so you
don't have to.
Similarly, L<Config::Auto> works out where your application might keep a
configuration file, works out what format it's in, parses it, and hands
you back a hash. No muss, no fuss. And more importantly, no need to even
think about writing a config file parser again. It's done once, forever.
L<Getopt::Auto> applies the same design principles to handling command
line arguments - I hate forgetting how to use C<Getopt::Long>.
Other attempts at making things simple for the end-user weren't that
successful. As part of writing my (first) mail archiving and indexing
program, C<Mail::Miner>, of which more later, I wanted a nice way for
users to specify a time period in which they're looking for mails - "a
week ago", "sometime last summer", "near the beginning of last month" -
and so on. L<Date::PeriodParser> would take these descriptions and turn
them into a start and end time in which to search. Except, of course,
that this is a very hard thing to do and requires a lot of heuristics,
and while I started off quite well, as ever, I got distracted with other
interesting and considerably more tractable problems.
=head2 Mail Handling
A good number of my Perl modules focussed on mail handling, so many that
I was actually able to get a job basically doing mail processing in
Perl. It all started with L<Mail::Audit>. I was introduced to
F<procmail> at University, and it was useful enough, but it kept having
locking problems and losing my mail, and I didn't really understand it,
to be honest, so I wanted to write my mail filtering rules in Perl.
C<Mail::Audit> worked well for a couple of years before it grew into an
OneHundredNotOut.pm view on Meta::CPAN
As part of the attempt to slim it back down again, I abstracted out one
of the major parts of its functionality, delivering an email to a local
mailbox. Now I only use mbox files, so it was reasonably easy for me,
but people wanted me to add Maildir and whatever to C<Mail::Audit>, so I
kicked it all out to L<Mail::LocalDelivery> instead.
But I found that I still wasn't able to filter my mail adequately and
find the stuff I needed from it. Attachments were a big problem, since
they both made ordinary search with C<grep> or C<grepmail> much slower,
and they weren't always easy to find anyway. So I wrote something to
remove attachments from mail and stick them in a database, and while I'm
at it, index mail for quick retrieval. And then it grew to identifying
"interesting" features of an email and searching for them too, and then
L<Mail::Miner> was born.
Finally, I got into web display of archived email, and needed a way of
displaying threads. Amazingly, nobody had coded up JWZ's mail threading
algorithm in Perl yet, so I did that too: L<Mail::Thread>.
But then I decided that C<Mail::*> was in a very sick state. I had been
working with the mail handling modules from CPAN - including my own -
and grown to hate them; they were all too slow, too complicated, too
buggy or all three. It was time for action, and the Perl Email Project
was born.
L<Email::Simple> was the first thing to come out of this, and is
a fantastic way of just getting at the bits you need from an email. It's
much simpler, and therefore much faster, than its more fully-featured
cousins on CPAN. L<Email::MIME> was its natural successor, which added
rudimentary MIME handling, and spawned two subsidiary modules,
L<Email::MIME::ContentType> and L<Email::MIME::Encodings> in order to
keep C<Email::MIME> itself focussed on the "do one thing and do it well"
principle.
Of course we then had to replace C<Mail::Audit>, so
L<Email::LocalDelivery> and L<Email::Filter> appeared. This is another
module I don't use, because my C<Mail::Audit> setup works and I'm
terrified of breaking it and losing all my mail. But I'm told that
C<Email::Filter> works just fine too.
By this stage, C<Mail::Miner> was getting crufty. It was replaced by a much
more modular and beautiful L<Email::Store>; this is extended with
plug-in modules like L<Email::Store::Summary>, L<Email::Store::Plucene>
and L<Email::Store::Thread>. I had to write the plug-in framework
myself, since neither C<Module::Pluggable> or C<Class::Trigger> did
quite what I wanted, and so the C<Email::Store> project also produced
L<Module::Pluggable::Ordered>.
Now C<Email::Store> naturally uses C<Email::Simple> objects, since
it's the most efficient mail representation class on CPAN.
Unfortunately, C<Email::Store> also wants to make use of some modules on
CPAN like C<Mail::ListDetector> which don't want to know about
C<Email::Simple> objects and want to talk C<Mail::Internet> or whatever.
To get around this, I wrote L<Email::Abstract> which provides module
writers with an interface to B<any> kind of mail object, so they don't
have to force a particular representation on their users.
=head2 Linguistics
I'm actually a linguist by training, not a computer programmer,
graduating from the school of Oriental Studies with second and third
year options in Japanese linguistics. I'd like to think that my work at
Kasei was as much about linguistic and textual analysis as it was about
mail munging. With that in mind, I wrote a few language-related modules
during my time with them.
The first important module, which I started work on while I was playing
with C<Mail::Miner>, was L<Lingua::EN::Keywords>. This started life as a
relatively naive algorithm for picking common words out of a text in an
attempt to provide some keywords to describe what the text is "about", and
has matured into quite a handy little automatic topic recognition
module. Its natural counterpart is L<Lingua::EN::NamedEntity>, which
B<is> still a naive algorithm but sometimes those are the best ones.
This module has a bit of story behind it. While analysing mails we were
trying to find people, places, times, and other things we could link
together into a knowledge base. The technical term for this is named
entity extraction. I find a useful library to do this, called C<GATE>.
It's written in Java, which meant using C<Inline::Java>, and is
extremely slow and complex. At the same time, I was writing a chapter on
computational linguistics with Perl in Advanced Perl Programming, and
wanted to talk about named entity extraction. Unfortunately, I only had
one module which did this, L<GATE::ANNIE::Simple>, and it was a hack. If
you're going to talk about a subject, it makes sense to compare and
contrast different solutions, and Tony had already been saying "why
don't you just write something to pull out capitalized phrases, for
starters?" I did this, intending to use it as a baseline, but of course
it's much faster than C<GATE> and not noticably less accurate. Ho hum.
Another thing those wacky computational linguists do a lot of is
working with n-gram streams. In every discipline, there's a particular
hammer you can use to solve any given problem. In data mining, it's
called market basket analysis. In computational linguistics, it's
maximal entropy. You look at the past stream of n characters (that's an
n-gram) and work out how hard it is to see what's coming next.
For instance, if I feed you the 4-gram C<xylo> the chances of a C<p>
next are very high. The chances of a C<e>, or indeed anything else, are
pretty low. Low entropy area. But if I feed you C<then>, it's really not
easy to guess the next letter, since we're likely to be at the end of a
word and the next word might be anything; high entropy. That's how you
use maximal entropy to find word breaks in unsegmented text, and there's
a huge amount of other cool stuff you can do with it.
I swear the day I wrote L<Text::Ngram>, there were no other modules on
CPAN which extracted n-grams, but as soon as I released it it looked
like there were three or four there all along. (Including one from
Jarkko, no less.) Anyway, I wanted to see if I could still remember how
to write XS modules, especially since I'd just written a book about it.
L<Lingua::EN::Inflect::Number> is a terrible hack, but it works. I
needed it to make C<Class::DBI::Relationship> (of which more later)
more human-friendly. L<Lingua::EN::FindNumber> is another hack written
for APP; I was a little surprised that C<Lingua::EN::Words2Nums>, which
is a fantastic module in its own right, can turn English descriptions of
numbers into digits, but it can't actually pull the numbers out of a
text in the first place. So I fixed that.
=head2 Text Munging, and Some More Mail Stuff
Applying my linguistic experience to the problems of intelligent mail
( run in 0.808 second using v1.01-cache-2.11-cpan-39bf76dae61 )