Acme-OneHundredNotOut

 view release on metacpan or  search on metacpan

OneHundredNotOut.pm  view on Meta::CPAN

automation of everyday life: checking my bank balance with
L<Finance::Bank::LloydsTSB> (the first Perl module to interface to
personal internet banking, no less) and my phone bill with a release of
Tony Bowden's L<Data::BT::PhoneBill>. 

L<Finance::Bank::LloydsTSB> was meant to go with L<Finance::QIF>, my
Quicken file parser, to produce another now-abandoned idea, a Perl
finances manager. It seemed that I'm only capable of producing modules,
not full standalone applications - or at least, it seemed that way until
I produced L<Bryar>, my blogging software, based on the concepts from
Rael Dornfest's C<blosxom> and beginning my adventures with Andy
Wardley's Template Toolkit. Bryar also tuned me in to the
Model-View-Controller framework idea, of which more later.

Another project I briefly played with was a personal robot, using the
C<Sphinx>/C<Festival> speech handling and recognition modules from
Cepstral and Kevin Lenzo. I didn't have X10, so I couldn't shout
"lights" into the air in a wonderfully scifi way, but I could shout
"mail" and have a summary of my inbox read to me, "news" to get the
latest BBC news headlines, and "time" to hear the time. Of course,
getting computers to tell the time nicely takes a little bit of work. I
don't like "It's eleven oh-three pee em", since that's not what someone
would say if you asked them the time. I wanted my robot to say "It's
just after eleven", and that's what L<Time::Human> does. Shame about the
localisation.

=head2 Messing About With Classes

One of the things that continues to amaze me about Perl is its
flexibility; the way you can change core parts of its operation, even
from pure Perl. This lead to quite a few modules, many of which were
mere proofs of concept.

L<Sub::Versive>, for instance, was the first module on CPAN to handle
pre- and post-hooks for a subroutine; it has since been joined by a
plethora of imitators. It was written, though, in response to a peculiar
scenario. I was writing a module (C<Safety::First>) which provided
additional built-in-like functions for Perl to encourage and facilitate
defensive programming and intelligible error reporting. ("Couldn't open
file? Why not?") These built-ins had to be available from every
package, which meant playing with C<UNIVERSAL::AUTOLOAD>. But what if
another package was already using C<UNIVERSAL::AUTOLOAD>? Hence,
C<Sub::Versive> wrapped it in a pre-hook. Of course, with the
interesting bit of the problem solved, C<Safety::First> was abandoned.

L<Class::Dynamic> was an interesting attempt to provide support for code
references in C<@ISA>, analogous to code references in C<@INC>. It
works, but of course I could never find any practical use for it.

L<Class::Wrap> was written as a lazy profiler. A certain application I
was writing for my employer of the time, Kasei, made use of the (IMHO
evil) C<Mail::Message> module. How do we isolate all calls to that
class? There are plenty of modules out there for instrumenting
individual methods, including of course C<Sub::Versive>. But the whole
class? C<Class::Wrap> takes a wonderfully brute-force but workable
approach to the problem. A real profiler, however, can be constructed
from L<Devel::DProfPP>, which is sort of a profiler toolkit.

I wrote a couple of other modules with Kasei in this category,
particularly while working on our Plucene port of the Lucene search
engine. (I guess I could claim C<Plucene> as one of my 100 modules, but
that would be to deny Marc Kerr the recognition he deserves for the work
he put in to packaging, documenting and providing tests for my insane
and scrambled code.) I wrote L<Bit::Vector::Minimal>, for instance, as I
ported C<org.apache.lucene.util.BitVector>; L<Tie::Array::Stored>, which
I'm amazed wasn't already implemented on CPAN, provided the Perl
equivalent of C<org.apache.lucene.util.PriorityQueue>.
L<Lucene::QueryParser>, of course, does what it says on the tin. (I also
produced a couple of add-ons for Plucene after leaving Kasei when I was
doing a bit of Plucene consultancy:
L<Plucene::Plugin::Analyzer::PorterAnalyzer> and
L<Plucene::Plugin::WeightedQueryParser>.)

Another module produced in the course of writing Plucene was
L<Class::HasA>, a handy little utility module which works well with Tony
Bowden's C<Class::Accessor> and merely dispatches certain method calls
to objects contained within your object.

And speaking of C<Class::Accessor>, L<Class::Accessor::Assert> would
have been a godsend while writing Plucene, as it's a version of accessor
handling which typechecks what you're putting into the accessor slots.
When you're converting a typed language into an untyped one, occasional
checks that you're handling the right kind of object don't go amiss. I
learnt my lesson eventually, though, and wrote the module after Plucene
was done.

Another Java-influenced module was C<Attribute::Final>, which was written 
for my book Advanced Perl Programming as an example of both attributes
and messing about with the class module - by marking some subtourines as
C<:final>, you get an error if a derived class attempts to override it.
As with many of my proof-of-concept modules, this isn't something I'd
ever use myself, but I know others have used it. I'll let you into a
secret - over the past few months I've settled on giving modules a
version number of C<0.x> if I've never used them myself and C<1.x> if I
have.

Java wasn't the only language to influence my Perl coding activities.
Ruby is a wonderful little language I first encountered in Japan, but
didn't really get into until around 2003. Of course, when you see
another language has dome good ideas, you steal them, which is what I
did with L<rubyisms>, L<SUPER>, and L<Class::SingletonMethod> - all of
which, by the way, are B<excellent> examples of what you can do to the
behaviour of Perl just from pure Perl. C<SUPER> is the kind of module
I've so often wanted to use in production code but never dared.

=head2 Smart Perl

My views on human-computer interface and computer usability have been
unchanged since I wrote C<Tie::DiscoveryHash> way back in the mists of
time. The underlying principle behind that module was simple: the user
should B<never> tell the computer anything it already knows or can
reasonably be expected to work out. C<Tie::DiscoveryHash> was all about
having the computer find out stuff for itself.

This has influenced a number of my modules, which have focussed on
trying to make everything as simple as possible for the user (or more
usually, for the programmer using my modules) and then a bit simpler.

So, for instance, I found the whole process of keeping values persistent
between runs of Perl a bit of a nightmare - I could never remember the
syntax for tying to C<DB_File>, and I would always forget to use the

OneHundredNotOut.pm  view on Meta::CPAN


By this stage, C<Mail::Miner> was getting crufty. It was replaced by a much
more modular and beautiful L<Email::Store>; this is extended with
plug-in modules like L<Email::Store::Summary>, L<Email::Store::Plucene>
and L<Email::Store::Thread>. I had to write the plug-in framework
myself, since neither C<Module::Pluggable> or C<Class::Trigger> did
quite what I wanted, and so the C<Email::Store> project also produced
L<Module::Pluggable::Ordered>.

Now C<Email::Store> naturally uses C<Email::Simple> objects, since
it's the most efficient mail representation class on CPAN.
Unfortunately, C<Email::Store> also wants to make use of some modules on
CPAN like C<Mail::ListDetector> which don't want to know about
C<Email::Simple> objects and want to talk C<Mail::Internet> or whatever.
To get around this, I wrote L<Email::Abstract> which provides module
writers with an interface to B<any> kind of mail object, so they don't
have to force a particular representation on their users. 

=head2 Linguistics

I'm actually a linguist by training, not a computer programmer,
graduating from the school of Oriental Studies with second and third
year options in Japanese linguistics. I'd like to think that my work at
Kasei was as much about linguistic and textual analysis as it was about
mail munging. With that in mind, I wrote a few language-related modules
during my time with them.

The first important module, which I started work on while I was playing
with C<Mail::Miner>, was L<Lingua::EN::Keywords>. This started life as a
relatively naive algorithm for picking common words out of a text in an
attempt to provide some keywords to describe what the text is "about", and
has matured into quite a handy little automatic topic recognition
module. Its natural counterpart is L<Lingua::EN::NamedEntity>, which
B<is> still a naive algorithm but sometimes those are the best ones.

This module has a bit of story behind it. While analysing mails we were
trying to find people, places, times, and other things we could link
together into a knowledge base. The technical term for this is named
entity extraction. I find a useful library to do this, called C<GATE>.
It's written in Java, which meant using C<Inline::Java>, and is
extremely slow and complex. At the same time, I was writing a chapter on
computational linguistics with Perl in Advanced Perl Programming, and
wanted to talk about named entity extraction. Unfortunately, I only had
one module which did this, L<GATE::ANNIE::Simple>, and it was a hack. If
you're going to talk about a subject, it makes sense to compare and
contrast different solutions, and Tony had already been saying "why
don't you just write something to pull out capitalized phrases, for
starters?" I did this, intending to use it as a baseline, but of course
it's much faster than C<GATE> and not noticably less accurate. Ho hum.

Another thing those wacky computational linguists do a lot of is
working with n-gram streams. In every discipline, there's a particular
hammer you can use to solve any given problem. In data mining, it's
called market basket analysis. In computational linguistics, it's
maximal entropy. You look at the past stream of n characters (that's an
n-gram) and work out how hard it is to see what's coming next. 

For instance, if I feed you the 4-gram C<xylo> the chances of a C<p>
next are very high. The chances of a C<e>, or indeed anything else, are
pretty low. Low entropy area. But if I feed you C<then>, it's really not
easy to guess the next letter, since we're likely to be at the end of a
word and the next word might be anything; high entropy. That's how you
use maximal entropy to find word breaks in unsegmented text, and there's
a huge amount of other cool stuff you can do with it. 

I swear the day I wrote L<Text::Ngram>, there were no other modules on
CPAN which extracted n-grams, but as soon as I released it it looked
like there were three or four there all along. (Including one from
Jarkko, no less.) Anyway, I wanted to see if I could still remember how
to write XS modules, especially since I'd just written a book about it.

L<Lingua::EN::Inflect::Number> is a terrible hack, but it works. I
needed it to make C<Class::DBI::Relationship> (of which more later)
more human-friendly. L<Lingua::EN::FindNumber> is another hack written
for APP; I was a little surprised that C<Lingua::EN::Words2Nums>, which
is a fantastic module in its own right, can turn English descriptions of
numbers into digits, but it can't actually pull the numbers out of a
text in the first place. So I fixed that.

=head2 Text Munging, and Some More Mail Stuff

Applying my linguistic experience to the problems of intelligent mail
indexing, searching and displaying led to churning out another set of
modules.

The first problem was what to do with search results. You know those
little snippets that Google and other search engines display when you
search for some terms? They contextualise the terms in the body of the
document and highlight them in a snippet that best represents how
they're used in the document. This is actually a really hard problem,
and it took me several goes to get L<Text::Context> right. It uses
L<Text::Context::EitherSide> as an "emergency" contextualizer if it
can't get anything right at all, but the algorithm itself is a bit of a
swine. I actually had to prototype this module in Ruby to get my
thinking clear enough to code it up in Perl...

L<Text::Quoted> was another mail display problem - it's nice to
display different layers of quoted text in an email in different
colours. Identifying the quoted text isn't that hard, but working out
a particular bit nests is also surprisingly tricky. So I sorted it out.

The next problem I had to solve lead on from this. Suppose you've got
some mail, which is plain text, and you're going to display it as HTML.
Along the way, you want to turn any URIs into links, (maybe using
something like L<URI::Find::Schemeless::Stricter> to find things which
look like URLs, but which doesn't think that numbered lists are IP
addresses) escape any non-HTML-safe characters, highlight search terms,
put different quoted regions in different colours, and maybe do other
things too. The thing is, you have to be very careful about the order in
which you do this. Once you've escaped the HTML, you might mess up your
colouring of quoted text, but if you've turned the URIs into links
first, you'll mess them up when you escape all the HTML entities.
L<Text::Decorator> allows you to do all these transformations in a nice,
safe way, "layering" things like URI escaping, highlighting, and so on,
and then rendering to text or HTML or whatever when all the layers have
been applied.

C<Text::Decorator> was written in a meta-programming system I wrote
called L<pool>, which I should probably use more. It writes the boring
bit of OO classes for you given a simple description of the methods and
attributes.



( run in 1.796 second using v1.01-cache-2.11-cpan-e1769b4cff6 )