Acme-OneHundredNotOut

 view release on metacpan or  search on metacpan

OneHundredNotOut.pm  view on Meta::CPAN

"interesting" features of an email and searching for them too, and then
L<Mail::Miner> was born.

Finally, I got into web display of archived email, and needed a way of
displaying threads. Amazingly, nobody had coded up JWZ's mail threading
algorithm in Perl yet, so I did that too: L<Mail::Thread>.

But then I decided that C<Mail::*> was in a very sick state. I had been
working with the mail handling modules from CPAN - including my own -
and grown to hate them; they were all too slow, too complicated, too
buggy or all three. It was time for action, and the Perl Email Project
was born. 

L<Email::Simple> was the first thing to come out of this, and is 
a fantastic way of just getting at the bits you need from an email. It's
much simpler, and therefore much faster, than its more fully-featured
cousins on CPAN. L<Email::MIME> was its natural successor, which added
rudimentary MIME handling, and spawned two subsidiary modules,
L<Email::MIME::ContentType> and L<Email::MIME::Encodings> in order to
keep C<Email::MIME> itself focussed on the "do one thing and do it well"
principle.

Of course we then had to replace C<Mail::Audit>, so
L<Email::LocalDelivery> and L<Email::Filter> appeared. This is another
module I don't use, because my C<Mail::Audit> setup works and I'm
terrified of breaking it and losing all my mail. But I'm told that
C<Email::Filter> works just fine too.

By this stage, C<Mail::Miner> was getting crufty. It was replaced by a much
more modular and beautiful L<Email::Store>; this is extended with
plug-in modules like L<Email::Store::Summary>, L<Email::Store::Plucene>
and L<Email::Store::Thread>. I had to write the plug-in framework
myself, since neither C<Module::Pluggable> or C<Class::Trigger> did
quite what I wanted, and so the C<Email::Store> project also produced
L<Module::Pluggable::Ordered>.

Now C<Email::Store> naturally uses C<Email::Simple> objects, since
it's the most efficient mail representation class on CPAN.
Unfortunately, C<Email::Store> also wants to make use of some modules on
CPAN like C<Mail::ListDetector> which don't want to know about
C<Email::Simple> objects and want to talk C<Mail::Internet> or whatever.
To get around this, I wrote L<Email::Abstract> which provides module
writers with an interface to B<any> kind of mail object, so they don't
have to force a particular representation on their users. 

=head2 Linguistics

I'm actually a linguist by training, not a computer programmer,
graduating from the school of Oriental Studies with second and third
year options in Japanese linguistics. I'd like to think that my work at
Kasei was as much about linguistic and textual analysis as it was about
mail munging. With that in mind, I wrote a few language-related modules
during my time with them.

The first important module, which I started work on while I was playing
with C<Mail::Miner>, was L<Lingua::EN::Keywords>. This started life as a
relatively naive algorithm for picking common words out of a text in an
attempt to provide some keywords to describe what the text is "about", and
has matured into quite a handy little automatic topic recognition
module. Its natural counterpart is L<Lingua::EN::NamedEntity>, which
B<is> still a naive algorithm but sometimes those are the best ones.

This module has a bit of story behind it. While analysing mails we were
trying to find people, places, times, and other things we could link
together into a knowledge base. The technical term for this is named
entity extraction. I find a useful library to do this, called C<GATE>.
It's written in Java, which meant using C<Inline::Java>, and is
extremely slow and complex. At the same time, I was writing a chapter on
computational linguistics with Perl in Advanced Perl Programming, and
wanted to talk about named entity extraction. Unfortunately, I only had
one module which did this, L<GATE::ANNIE::Simple>, and it was a hack. If
you're going to talk about a subject, it makes sense to compare and
contrast different solutions, and Tony had already been saying "why
don't you just write something to pull out capitalized phrases, for
starters?" I did this, intending to use it as a baseline, but of course
it's much faster than C<GATE> and not noticably less accurate. Ho hum.

Another thing those wacky computational linguists do a lot of is
working with n-gram streams. In every discipline, there's a particular
hammer you can use to solve any given problem. In data mining, it's
called market basket analysis. In computational linguistics, it's
maximal entropy. You look at the past stream of n characters (that's an
n-gram) and work out how hard it is to see what's coming next. 

For instance, if I feed you the 4-gram C<xylo> the chances of a C<p>
next are very high. The chances of a C<e>, or indeed anything else, are
pretty low. Low entropy area. But if I feed you C<then>, it's really not
easy to guess the next letter, since we're likely to be at the end of a
word and the next word might be anything; high entropy. That's how you
use maximal entropy to find word breaks in unsegmented text, and there's
a huge amount of other cool stuff you can do with it. 

I swear the day I wrote L<Text::Ngram>, there were no other modules on
CPAN which extracted n-grams, but as soon as I released it it looked
like there were three or four there all along. (Including one from
Jarkko, no less.) Anyway, I wanted to see if I could still remember how
to write XS modules, especially since I'd just written a book about it.

L<Lingua::EN::Inflect::Number> is a terrible hack, but it works. I
needed it to make C<Class::DBI::Relationship> (of which more later)
more human-friendly. L<Lingua::EN::FindNumber> is another hack written
for APP; I was a little surprised that C<Lingua::EN::Words2Nums>, which
is a fantastic module in its own right, can turn English descriptions of
numbers into digits, but it can't actually pull the numbers out of a
text in the first place. So I fixed that.

=head2 Text Munging, and Some More Mail Stuff

Applying my linguistic experience to the problems of intelligent mail
indexing, searching and displaying led to churning out another set of
modules.

The first problem was what to do with search results. You know those
little snippets that Google and other search engines display when you
search for some terms? They contextualise the terms in the body of the
document and highlight them in a snippet that best represents how
they're used in the document. This is actually a really hard problem,
and it took me several goes to get L<Text::Context> right. It uses
L<Text::Context::EitherSide> as an "emergency" contextualizer if it
can't get anything right at all, but the algorithm itself is a bit of a
swine. I actually had to prototype this module in Ruby to get my
thinking clear enough to code it up in Perl...

L<Text::Quoted> was another mail display problem - it's nice to
display different layers of quoted text in an email in different
colours. Identifying the quoted text isn't that hard, but working out
a particular bit nests is also surprisingly tricky. So I sorted it out.

The next problem I had to solve lead on from this. Suppose you've got
some mail, which is plain text, and you're going to display it as HTML.
Along the way, you want to turn any URIs into links, (maybe using
something like L<URI::Find::Schemeless::Stricter> to find things which
look like URLs, but which doesn't think that numbered lists are IP
addresses) escape any non-HTML-safe characters, highlight search terms,
put different quoted regions in different colours, and maybe do other
things too. The thing is, you have to be very careful about the order in
which you do this. Once you've escaped the HTML, you might mess up your
colouring of quoted text, but if you've turned the URIs into links
first, you'll mess them up when you escape all the HTML entities.
L<Text::Decorator> allows you to do all these transformations in a nice,
safe way, "layering" things like URI escaping, highlighting, and so on,
and then rendering to text or HTML or whatever when all the layers have
been applied.

C<Text::Decorator> was written in a meta-programming system I wrote
called L<pool>, which I should probably use more. It writes the boring
bit of OO classes for you given a simple description of the methods and
attributes.

Oh, and if you're not contextualising search terms in a mail snippet,
you probably just want to display the original content rather than the
first few lines, which invariablely contain lots of quoting of another
message. L<Text::Original>, extracted from the code of the Mariachi
project and so actually only packaged by me and written by Richard Clamp
and Simon Wistor, does just this.

L<WWW::Hotmail> was an attempt to solve the problem of how to import all
the mail a user already has into our archiving program, a problem Gmail
is now dealing with. Actually, Gmail's currently dealing with pretty
much all the problems we looked at last year. It's quite funny, really.

=head2 SIMON Hits The Web

I hate web programming. HTML is boring, CGI is boring, and I tried
avoiding it for as long as I could. This stopped when I worked for
Oxford University, handling their webmail service, which lead to
L<Bundle::WING>. Also at Oxford, I had to work with C<AxKit>, which
caused me innumerable headaches but I finally got some working XSP
applications written, not without writing the
L<Apache::AxKit::Language::XSP::ObjectTaglib> and
L<AxKit::XSP::Minisession> helper modules. I also did some playing
around with C<mod_perl>, thanks to the rather wonderful I<mod_perl
Cookbook>, and came up with L<Apache::OneTimeURL> when, during a
particularly paranoid phase, I wanted to give out my physical address
in URLs that would self-destruct after a single reading.

After leaving, though, I discovered the C<Class::DBI>/Template Toolkit
pair which has dominated my web programming since then. If you haven't
played with these two modules yet, you really need to, since they
work so well together, and with other modules like C<CGI::Untaint>, that 
they simplify so much of web and database work. I extended
C<CGI::Untaint> with a bunch of extra patterns while at Kasei and
afterwards, including L<CGI::Untaint::ipaddress>,
L<CGI::Untaint::upload> and L<CGI::Untaint::html>, 
I also wrote a whole plethora of C<CDBI> extensions:
L<Class::DBI::AsForm>, L<Class::DBI::Plugin::Type>,
L<Class::DBI::Loader::GraphViz> (reflecting my penchant for data
visualization), and L<Class::DBI::Loader::Relationship>, which applies
the "as simple as possible and a bit simpler" approach to defining data
relationships.

The whole culmination of C<CDBI>, TT, and all these other technologies
came when I sat down and wrote L<Maypole>, a Model-View-Controller
framework with, again, emphasis on making things very simple to get
working. The Perl Foundation's sponsorship of Maypole development has
been one of the proudest achievements in my CPAN career, and lead not
only to a stonking big manual, loads of examples, but also
L<Maypole::Authentication::UserSessionCookie> and L<Maypole::Component>.

Template Toolkit and XML came back together again in a recent project
where I've had render some XML as part of a Maypole application.
Amazingly, there wasn't an XSLT filter for the Template Toolkit, so
L<Template::Plugin::XSLT> was born.

=head2 Games, Diversions and Toys

It was only when I got back from Japan that I learnt to play Go. How
stupid was that. For a year I had access to some of the best Go clubs
and professional teacher and players in the world, and then I only pick
the bloody game up when I get back to England. Anyway, any computer
programmer who learns to play go, and they all do soon or later,
eventually decides to do something about the pitiful state of computer
Go. It's quite ridiculous that the game's been around for thousands of
years and the best computer programs we've devised regularly get beaten
resoundingly by small children. Anyway, I did my bit, producing
L<Games::Go::GMP> and L<Games::Go::SGF> as utility libraries, before
working on L<Games::Goban> to represent the state of the game.

But then while working for Kasei we discovered another addictive
diversion: poker. Computer poker isn't that great either, and I wanted
to write some robots to play on the internet poker servers;
L<Games::Poker::HandEvaluator> was the first product there, with the
hard work done by a GNU library, and L<Games::Poker::OPP> being the
interface to the network protocol. The comments to that module contain a
large number of Prisoner references, for no apparent reason. C<OPP>
needed a way of representing the state of a poker game, so I wrote
L<Games::Poker::TexasHold'em> to do that. And also because it was a
fantastic abuse of the C<'> package separator.

Oh, and another of my early modules that refused to die was
L<Oxford::Calendar>, which converts between the academic calendar and
the rest of the world's. It all counts, you know.

=head2 The Future

I've had mixed feelings on Perl 6, starting with my very public
nightmare at its announcement in 1999, (Hey, I'd just written a book on
Perl 5 internals, and now they're telling me it's obsolete.) and then my
very public repentance in 2000, at which point I was very excited about
the whole thing. So much so, that I produced vast numbers of design
documents for the language, most of which now ignored, but that's OK,
and set to work helping Dan design the core of the interpreter too. In
fact, I somehow managed to do so much work on it that, after a hacking
session together at O'Reilly in Boston in 2001, Dan let me be the
release pumpking of L<parrot>, a job I did until life got busy in 2002.
I'm extremely happy to have been involved in that, and hope I didn't
start the project off on too much of a bad footing. It looks to be doing
fine now, at least.

I was still interested in how they're going to make the Perl 6 parser
work, (I still am, but don't have enough time to throw at the problem)
and with my linguistic background I've always been interested in writing
parsers in general. So early on I started trying to write a
L<Perl6::Tokener>, which is now unfortunately quite obsolete, with the
intention of writing a parser later on. For most of 2002, my whiteboard
at home was covered with sketches of the Perl 6 grammar. 

Then I found out that the parser is actually going to be dynamic - you
can reconfigure the grammar at runtime. Hey, I thought, that's going to
be fun. At this point, you can't use an ordinary state-table parser like
C<yacc>, as Perl has done so far, because that pre-computes the
transitions up front. Instead, you have to use a proper state machine
without pre-computed tables. But I couldn't find any parsers which
worked on that basis, so I wrote one, L<shishi>, prototyping it in Perl
with L<Shishi::Prototype> first. 

This work has been largely ignored, unfortunately, but that's because
mainly I haven't had the time to do interesting user-facing stuff on top
of it so that it can be shown off. I tried porting C<Parse::RecDescent>
to it (using L<Parse::RecDescent::Deparse> to figure out what C<P::RD>
was doing) to produce a much faster recursive descent parser, but when I
heard that Damian Conway was funded to work on C<Parse::FastDescent> and
C<Parse::Perl>, (yes, I have a prototype of that too) I decided to leave
him to it. After all, why should I do the work and have other people get



( run in 1.094 second using v1.01-cache-2.11-cpan-39bf76dae61 )