Benchmark-Perl-Formance-Cargo
view release on metacpan or search on metacpan
share/SpamAssassin/easy_ham/01713.7e6c3f51ab4a45f60fbb0968d56f512c view on Meta::CPAN
> latter doesn't contain the same text as the text/html part (for
> example, as Anthony reported, perhaps the text/plain part just
> says something like "This is an HMTL message.").
>
> If it's #2, it would be easy to add an optional bool argument to tokenize()
> meaning "even if it is pure HTML, strip the tags anyway". In fact, I'd like
> to do that and default it to True. The extreme hatred of HTML on tech lists
> strikes me as, umm, extreme <wink>.
I also looked in more detail at some f-p's in my geeks traffic. The
first one's a doozie (that's the term, right? :-). It has lots of
HTML clues that are apparently ignored. It was a multipart/mixed with
two parts: a brief text/plain part containing one or two sentences, a
mondo weird URL:
http://x60.deja.com/[ST_rn=ps]/getdoc.xp?AN=687715863&CONTEXT=973121507.1408827441&hitnum=23
and some employer-generated spammish boilerplate; the second part was
the HTML taken directly from the above URL. Clues:
43 1.00 S '"main"': 0.01; '(later': 0.01; '(lots': 0.01; '--paul':
0.01; '1995-2000': 0.01; 'adopt': 0.01; 'apps': 0.01; 'commands':
0.01; 'deja.com': 0.01; 'dejanews,': 0.01; 'discipline': 0.01;
'duct': 0.01; 'email addr:digicool': 0.01; 'email name:paul':
0.01; 'everitt': 0.01; 'exist,': 0.01; 'forwards': 0.01;
'framework': 0.01; 'from:email addr:digicool': 0.01; 'from:email
name:<paul': 0.01; 'from:paul': 0.01; 'height': 0.01;
'hodge-podge': 0.01; 'http0:deja': 0.01; 'http0:zope': 0.01;
'http1:[st_rn': 0.01; 'http1:comp': 0.01; 'http1:getdoc': 0.01;
'http1:ps]': 0.01; 'http>1:22': 0.01; 'http>1:24': 0.01;
'http>1:57': 0.01; 'http>1:an': 0.01; 'http>1:author': 0.01;
'http>1:fmt': 0.01; 'http>1:getdoc': 0.01; 'http>1:pr': 0.01;
'http>1:products': 0.01; 'http>1:query': 0.01; 'http>1:search':
0.01; 'http>1:viewthread': 0.01; 'http>1:xp': 0.01; 'http>1:zope':
0.01; 'inventing': 0.01; 'jsp': 0.01; 'jsp.': 0.01; 'logic': 0.01;
'maps': 0.01; 'neo': 0.01; 'newsgroup,': 0.01; 'object': 0.01;
'popup': 0.01; 'probable': 0.01; 'query': 0.01; 'query,': 0.01;
'resizes': 0.01; 'servlet': 0.01; 'skip:? 20': 0.01; 'stems':
0.01; 'subject:JSP': 0.01; 'sucks!': 0.01; 'templating': 0.01;
'tempted': 0.01; 'url.': 0.01; 'usenet': 0.01; 'usenet,': 0.01;
'wrote': 0.01; 'x-mailer:mozilla 4.74 [en] (windows nt 5.0; u)':
0.01; 'zope': 0.01; '#000000;': 0.99; '#cc0000;': 0.99;
'#ff3300;': 0.99; '#ff6600;': 0.99; '#ffffff;': 0.99; '©':
0.99; '>': 0.99; ' ': 0.99; '"no': 0.99;
'.med': 0.99; '.small': 0.99; '0pt;': 0.99; '0px;': 0.99; '10px;':
0.99; '11pt;': 0.99; '12px;': 0.99; '18pt;': 0.99; '18px;': 0.99;
'1pt;': 0.99; '2px;': 0.99; '640;': 0.99; '8pt;': 0.99; '<!--':
0.99; '</b>': 0.99; '</body>': 0.99; '</head>': 0.99; '</html>':
0.99; '</script>': 0.99; '</select>': 0.99; '</span>': 0.99;
'</style>': 0.99; '</table>': 0.99; '</td>': 0.99; '</td></tr>':
0.99; '</tr>': 0.99; '</tr><tr': 0.99; '<b><a': 0.99; '<base':
0.99; '<body': 0.99; '<br>': 0.99; '<br> ': 0.99; '<br><a':
0.99; '<br><span': 0.99; '<font': 0.99; '<form': 0.99; '<head>':
0.99; '<html>': 0.99; '<img': 0.99; '<input': 0.99; '<meta': 0.99;
'<option': 0.99; '<p>': 0.99; '<p>a': 0.99; '<script>': 0.99;
'<select': 0.99; '<span': 0.99; '<style>': 0.99; '<table': 0.99;
'<td': 0.99; '<td>': 0.99; '<td></td>': 0.99; '<td><img': 0.99;
'<tr': 0.99; '<tr>': 0.99; '<tr><td': 0.99; '<tr><td><img': 0.99;
'absolute;': 0.99; 'align="left"': 0.99; 'align=center': 0.99;
'align=left': 0.99; 'align=middle': 0.99; 'align=right': 0.99;
'align=right>': 0.99; 'alt=""': 0.99; 'bold;': 0.99; 'border=0':
0.99; 'border=0>': 0.99; 'color:': 0.99; 'colspan=2': 0.99;
'colspan=2>': 0.99; 'colspan=4': 0.99; 'face="arial"': 0.99;
'font-family:': 0.99; 'font-size:': 0.99; 'font-weight:': 0.99;
'footer': 0.99; 'for<br>': 0.99; 'fucking<br>': 0.99;
'height="1"': 0.99; 'height="16"': 0.99; 'height=1': 0.99;
'height=12': 0.99; 'height=125': 0.99; 'height=17': 0.99;
'height=18': 0.99; 'height=21': 0.99; 'height=4': 0.99;
'height=57': 0.99; 'height=60': 0.99; 'height=8': 0.99;
'hspace=0': 0.99; 'http0:g': 0.99; 'http0:web2': 0.99; 'http1:0':
0.99; 'http1:ads': 0.99; 'http1:d': 0.99; 'http1:page': 0.99;
'http1:site': 0.99; 'http>1:article': 0.99; 'http>1:back': 0.99;
'http>1:com': 0.99; 'http>1:d': 0.99; 'http>1:gif': 0.99;
'http>1:go': 0.99; 'http>1:group': 0.99; 'http>1:http': 0.99;
'http>1:post': 0.99; 'http>1:ps': 0.99; 'http>1:site': 0.99;
'http>1:st': 0.99; 'http>1:title': 0.99; 'http>1:yahoo': 0.99;
'inc.</a>': 0.99; 'jobs!': 0.99; 'normal;': 0.99; 'nowrap': 0.99;
'nowrap>': 0.99; 'nowrap><font': 0.99; 'padding:': 0.99;
'rowspan=2': 0.99; 'rowspan=3': 0.99; 'servlets,': 0.99;
'size=15': 0.99; 'size=35': 0.99; 'skip:< 10': 0.99; 'skip:b 60':
0.99; 'skip:h 110': 0.99; 'skip:h 170': 0.99; 'skip:h 200': 0.99;
'skip:h 240': 0.99; 'skip:h 250': 0.99; 'skip:h 290': 0.99;
'skip:v 40': 0.99; 'solid;': 0.99; 'text=#000000': 0.99; 'to<br>':
0.99; 'type="image"': 0.99; 'type="text"': 0.99; 'type=hidden':
0.99; 'type=image': 0.99; 'type=radio': 0.99; 'type=submit': 0.99;
'type=text': 0.99; 'valign=top': 0.99; 'valign=top>': 0.99;
'value="">': 0.99; 'visibility:': 0.99; 'width:': 0.99;
'width="33"': 0.99; 'width=1': 0.99; 'width=100%': 0.99;
'width=100%>': 0.99; 'width=12': 0.99; 'width=125': 0.99;
'width=130': 0.99; 'width=137': 0.99; 'width=2': 0.99; 'width=20':
0.99; 'width=25': 0.99; 'width=4': 0.99; 'width=468': 0.99;
'width=6': 0.99; 'width=72': 0.99; 'works<br>': 0.99
The second f-p had the same structure (and sender :-); the third f-p
had the same structure and a different sender. Ditto the fifth, sixth. (Not posting clues for
brevity.)
The fourth was different: plaintext with one very short sentence and a
URL. Clues:
300 1.00 S 'from:email addr:digicool': 0.01; 'http1:news': 0.24;
'from:email addr:com>': 0.32; 'from:tres': 0.50; 'http>1:1114digi':
0.50; 'proto:http': 0.50; 'subject:Geeks': 0.50; 'x-mailer:mozilla
4.75 [en] (x11; u; linux 2.2.14-5.0smp i686)': 0.50; 'take': 0.54;
'bool:noorg': 0.61; 'http0:com': 0.66; 'skip:h 50': 0.83;
'http>1:htm': 0.90; 'subject:Software': 0.96; 'http>1:business':
0.99; 'http>1:local': 0.99; 'subject:firm': 0.99; 'us:': 0.99
The seventh was similar.
I scanned a bunch more until I got bored, and most of them were either
of the first form (brief text with URL followed by quoted HTML from
website) or the second (brief text with one or more URLs).
It's up to you to decide what to call this, but I think these are none
of your #1, #2 or #3 (they're close to #3, but all are multipart/mixed
rather than multipart/alternative).
> > So I guess I'll have to retrain it (yes, you told me so :-).
>
> That would be a different experiment. I'm certainly curious to see whether
( run in 2.412 seconds using v1.01-cache-2.11-cpan-140bd7fdf52 )