[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Ifile-discuss] usage of ifiles threshold option?
From: |
Paolo |
Subject: |
Re: [Ifile-discuss] usage of ifiles threshold option? |
Date: |
Tue, 8 Mar 2005 14:44:16 +0100 |
User-agent: |
Mutt/1.3.28i |
On Mon, Mar 07, 2005 at 12:48:13PM +0100, C. Fischer wrote:
> could somebody please give an example of using ifiles `-T' (--threshold)
> option? i want to know how to derive a specific number for it.
hello Clemens,
-T was introduced to allow for a 'grey zone' between the 2 winning
categories (among 2 or more in the database). I.e., in a sense, it makes
1 further bin 'on the fly', into which the test item is thrown, whenever
the 2 topmost ranks are closer than the threshold, in relative terms,
according the the formula you get with --help:
R=(r0-r1)/(r0+r1), R*1000 < THRESH
if THRESH > 0.
Actually, you get 2 'grey zones', as you'd get a response like cat1,cat2
or cat2,cat1 according to which rank is absolute max.
In spam filtering, eg you can do a coarse classification with large
threshold, and less comp.-expansive preprocessing, then reprocess with
with narrower threshold, better preproc, MIME decoding etc. what makes
into the 'unsure' bin on 2st pass.
- In previous msg you mentioned MIME processing: AFAIKT, that's not much
effective WRT spam/ham classification - see reports in other projects, eg
CRM114 (crm114.sf.net) - see there as well for link to 'normalizemime', a
tool to mangle/sanitize an RFC [2]822 msg in UTF-*.
- For possible algos/how to implement BCR, besides ifile itself and related
papers, see comments in crm114 code, and you may want also to have a look at
dabcl / L.Breyer sw/site : http://www.lbreyer.com/emailtut.html
hope his helps - if you come up with anything new/interesting pls report
back :)
--
paolo
GPG/PGP id:0x21426690 kfp:EDFB 0103 A8D8 4180 8AB5 D59E 9771 0F28 2142 6690
"Indeed, it does come with warranty: it *will* fail, sometimes, somehow..."
- software vendor